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ABSTRACT » 

Alternative mathematical models for scoring and 
deci-sion making with criterion referenced tests are described, 
-especiallf ^s they concern appropriate test length and methods of ' 
establishing/statistically valid cutting scores. Several of , these 
approaches are reviewed and compart on formal-analytic and empirical 
grounds: ^1) iBlock 's - approach to setting mastery standards, 
student perfoplDance and academic requirements; (2) Crehan's 
classification ^ 'C^paring scores of students who have and who iiav.e 
not completed training; (3) the probabilistic models of Macready, 
Dayton, and Emrick, which assume an equal proportion of masters and 
nonmasters; (4) the binomial distribution model, which allows for 
partial acquisition of proficiency; (5) the Bayesian model, wuicii 
considers jfrior experience; (6) Rasch«s one- parameter logistic model, 
which yields person-free test calibrations and item-free person 
measurements; and (7) the regression approach of classical test 
theory, which enables tfra.^^tima tion of true scores to be' ipade^ from 
observed scores* Examples o^^hese ^approaches are given, as well 
their advantag^f, disadvantages, and ambiguities. (GDC) " 
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FOREWORD 



The research presented in this report was conducted' under Project 
METTEST (Methodological Issues in Criterion- Referenced Testing) , in the 
Unit Training and Evaluation Systems (UTES) Technical Area of ART under 
Army RDTE Project 2Q62722A764. The goal of Project METTEST is to pro- 
vide quantiyT:ive methods for evaluating unit proficiency. The means 
for achievirlg this goal include basic research in test construction . 
methodolo"^5si measurement and scaling mode-Is, and decisionmaking impli- 
■*.(?^tions of test score interpretation. 

Related, ongoing prografns within the UTES Technical Area include 
evaluation of small combat units under simulated battlefield conditions 
(REALTRAIN, ARTEP) , qualification of tank crews and platoon gunnery 
(IDOC) , and improvement of the reliability of ARTEP evaluation. 

. .Anticipated future research under Project METTEST includes the de- 
velopment of a computer model for pei'formance evaluation, and develop- 
ment of measurement, scaling, scoring, decis ionma]ting, and quality 
control models for use in performance evaluations when criterion- 
referenced testing procedures are employed. 

* / 
ARI research in this area is conducted -as an in-house research ef- 
fort augmented by contracts with organizations selected as having unique 
capabilities and facilities for research in a specific area. The pres- 
ent study was conducted in collaboration with personnel of the ,Univer- 
sity of Maryland under Contract No. DAHC19-75-M-0003 . 





CRITERION- REFERENCED TESTING: A CRITICAL ANALYSIS OF SELECTED MODELS 



BRIEF 



Requirement; ^ . 

To develop a theoretical base for research and eventual application 
of methods for assigning pass-fail scores in personnel and ^nit evalua- 
tion using the criterion- referenced testing approach. 

Procedure: , ^ 

Relevant literature for each of five approaches to criterion- 
referenced testing was reviewed. The approaches were compared on the 
basis of the following: assumptions and rationale, the interactive ef- 
fects, of test length and passing criteria on classification accuracy, 
and areas of applicf^bility . A computational example was prepared for 
each model, and strengths and weaknesses were also evaluated. 



Findings : 

.Four of the five models were able to specify an "optimal" test 
length and cutoff score, although they differed as to. the required 
parameter estimates from ^he test developer. For example, expert 
"prioi*" information can be used to reduce test length. Each of the ^ 
models also provides an estimate for misclassi fic^tior^s , or Type I and 
Type II errors. The models are neither redundant nor interchangeable. 
No "best" method was identified. Rather, the selection of a model de- 
pends upon, the particular measurement requirements and constraints as 
identified by the test developer. 

Utilization of findings: 

This research provides -qualitative and quantitative guidelines for 
developers of criterion- referenced tests. The models have been applied 
to anal^^ze data from the handgun qualification course at' the U.S. Army 
Military Police School. Application of the models has also -been ad- 
dressed to revision of Table VIII tank gunfiery. 
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CRITERION-I^FERENCED TESTING: A CRITICAL 
ANALYSIS OF SELECT^ MODELS 



INTRODUCTION 

Scoring and decisionmaking models for criterion-referenced testinq 
deal with two questions of practical and theoretical importance: (1) 
how much test information should be collected to provide a basis for 
confident decisions about the mastery or nonmastery of trained skills; 
and (2) what are the methods of establishing statistically valid stand- 
ards of achievement. Criterion- referenced testing (CRT) requires that 
the data provide information about performance capabilities measured 
against some external criterion (Glaser & Nitko, 1971; Carver, 1974). 
^ Such criteria are properly derived from an analysis of the requirements 
for performing specific tasks successfully. 

Measurement of, mastery implies that CRT's should represent the skill 
^to be measured with high fidelity. However, serious constraints are 
imposed by requiring high fidelity: (1) the time needed to administer ' 
the test may be more than is readily available; (2) the number of exami- 
ners needed to administer the test and collect data may be excessive; 
(3) the expenditure of materials used in testing may be prohibitively 
high; and ^4) the appropriate testing materials or apparatus may not 
be available for a long enough time. These constraints place a premium 
upon' limiting test data to the minimum amount sufficient for the desired 
quality of decisionmaking. Statistical models offer one means of accom- 
plishing this goal. 

Two problems- arise in establishing achievement standards on CRT's. 
The first is related to the congruence betweefi* CRT performance and real- 
world requirements. ^ The second is related to the statistical inferences 
applied to observed CRT scores. 



requ 



Before any statistical model can be used. in a CRT situation, the 
_ irements for mastery over the domain in general must be specified. 
The requirements usually describe the capabilities of persons who can 
Strccessfully perform the tasks included in the domain. Glaser and 
Klaus (1963) suggest that "proficiency standards can be established 
at any value between the point where the system will not perform at 
all and th6^ point where any further contribution frbm the human com- 
ponent will not yield any increase in system performance (p. 424) 
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FOREWORD 



i 



The research presented in this report was conducted' under Project 
METTEST (Methodological Issues in Criterion- Referenced Testing) , in the 
Unit Training and Evaluation Systems (UTES) Technical Area of ART under 
Army RDTE Project 2Q62722A764. The goal of Project METTEST is to pro- 
vide quantiyT:ive methods for evaluating unit proficiency. The means 

., for achievirig this goal include basic research in test construction - 
methodolo"^5si measurement and scaling mode-Is, and decisionmaking impli- 

i^^^^.^^tions of test score interpretation. 

Related, ongoing prografns within the UTES Technical Area include 
evaluation of small combat units under simulated battlefield conditions 
(REALTRAIN, ARTEP) , qualification of tank crews and platoon gunnery 
(IDOC) , and improvement of the reliability of ARTEP evaluation. 

. .Anticipated future research under Project METTEST includes the de- 
velopment of a computer model for pei'formance evaluation, and develop- 
ment of measurement, scaling, scoring, decis ionma]ting, and quality 
control models for use in performance evaluations when criterion- 
referenced testing procedures are employed. 

♦ / 
ARI research in this area is conducted as an in-house research ef- 
fort augmented by contracts with organizations selected as having unique 
capabilities and facilities for research in a specific area. The pres- 
ent study was conducted in collaboration with personnel of the , Univer- 
sity of Maryland under Contract No. DAHC19-75-M-0003 . 
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BRIEF 



Requirement; . •* 

To develop a theoretical base for research and eventual application 
of methods for assigning pass-fail scores in personnel and ^nit evalua- 
tion using the criterion- referenced testing approach. 

Procedure: ^ 

Relevant literature for each of five approaches to criterion- 
referenced testing was reviewed. The approaches were compared on the 
basis of the following: assumptions and rationale, the interactive ef- 
fects, of test length and passing criteria on classification accuracy, 
and areas of applicf^bility . A computational example was prepared for 
each model, and strengths and weaknesses were also evaluated. 



Findings: 

.Four of the five models were able to specify an "optimal" test 
length and cutoff score, although they differed as to. the required 
parameter estimates from ^he test developer. For example, expert 
"prioif" information can be used to reduce test length. Each of the ^ 
models also provides an estimate for misclassi fic^tior^s , or Type I and 
Type II errors. The models are neither redundant nor interchangeable. 
No "best" method was identified. Rather, the selection of a model de- 
pends upon, the particular measurement requirements and constraints as 
identified by the test developer. 

Utilization of findings: 

This research provides -qualitative and quantitative guidelines for 
developers of criterion- referenced tests. The models have been applied 
to anal^^ze data from the handgun qualification course at' the U.S. Army 
^ Military Police School. Application of the models has also 'been ad- 
dressed to revision of Table VIII tank gunfiery. 
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CRITERION-I^FERENCED TESTING: A CRITICAL 
ANALYSIS OF SELECTED MODELS 



INTRODUCTION 



and (2) 



ana [Z) wHat are the methods of establishing statistically valid stand- 
ards of achievement. Criterion- referenced testing (CRT) requires that 
the data provide information about performance capabilities measured 
against some external criterion (Glaser & Nitko, 1971; Carver, 1974). 
^ Such criteria are properly derived from an analysis of the requirements 
for performing specific tasks successfully. 

Measurement of, mastery implies that CRT's should represent the skill 
^to be measured with high fidelity. However, serious constraints are 
imposed by requiring high fidelity: (1) the time needed to administer ' 
the test may be more than is readily available; (2) the number of exami- 
ners needed to administer the test and collect data may be excessive; 
. (3) the expenditure of materials used in testing may be prohibitively 
high; and ^4) the appropriate testing materials or apparatus may not 
be available for a long enough time. These constraints place a premium 
upon limiting test data to the minimum amount sufficient for the desired 
quality of decisionmaking. Statistical models offer one means of accom- 
plishing this goal. 

Two problems, arise in establishing achievement standards on CRT's. 
The first is related to the congruence betweefi* CRT performance and real- 
world requirements. ^ The second is related to the statistical inference^ 
applied to observed CRT scores. 

Before any statistical model can be used. in a CRT situation, the 
requirements for mastery over the domain in general must be specified. 
The requirements usually describe the capabilities of persons who can > 
Strccessfully perform the tasks included in the domain. Glaser and 
Klaus (1963) suggest that "proficiency standards can be established 
at any value between the point where the system will not perform at 
all and th6^ point where any further contribution from the human com- 
ponent will not yield any increase in system performance (p. 424) 

These system requirements may include the human performance com- 
ponents of industrial- vocational tasks, minimal competencies in an 
educational system, or basic literacy skills. System requirements 
may also reflect manpower needs, the criticality of the task, or the 
consequences of poor performance. Such idealized standards must then 
be converted to standards on a partipular CRT. The conversion process 
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involves issues of /test v^J^idity whidh are beyond the scope- of this" 
paper. • Meskauskas (1976) discusses several methods that have been' used 
tb bridge the gap' fcetwe^n pperatioTial"^' tests and real-world requiremen'ts . 

^"-■^ • ■ ft" - ■ - 

\^ tiiev^-CRT incllides the' entire ^full fidelity task> siich as disas- 

sembling and cle'^ning^ a particular piece of machinery, then setting 
mastery standards is .relative'ly ^ear and- unambiguous 7^ 'However , ff the 
CRT incluSes only a sample of :^the full fidelity task, -or if fidelity is 
decreased for practical purposed, the^ Inastery standards for the CRT 
are hdt ccl^earcut - Heretofore, the use of arbitraj^ry cutoff scores has ^ 
kept- this problem 'at a manageable level. F^r example, objectivSy Lr§4;^ 
inaly^e'-a statement of standards r^equiring a certain minimum percent 
corJfect fo'r attainment of mas^ry status. Two criticisms can be di- 

; rected ^t" this concept of mastery. 

First, any percentage correct is a relative s'tandard . . The defini-^ 
tion of mastery has been shown (Millman, 19/2; Nc^vick & Lewis, 1974; 
Epstein & Steinheiser, 1975) to be a function both of the percentage 
correct ^nd of,^ the numb.er of trials or items that comprise the test. 
A more comprehensive definition could be based either upon (1) an ideal- 
ization, * such as the proportion of correct answers of all possible test 
items, or (2) the position on an underlying continuum of ability hypoth- 
esized to score ^^^vexami nee on a given test. By stating standards in 
"terms of such an idealization or ability continuum, it is possible to 
explicitly define mastery cutoff scores for^any test length. 

A /- ■ - 

The second criticism refers to the level of ability required for 
mastery. For example, why should one standard (such as 80% correct) 
be set rather than another (such as 70% of 90%) ? Perhaps this question 
could be answered by empirical studies showing the relationship between 
CRT scores and the transfer or retention of training. The required 
level of mastery could also be determined by system requirements, criti- 
cality, and similar factors. 

Each of the models discussed in this paper, with the exception of 
Block's (1972) approach to setting standards empirically , assumes that 
a well-defined universe of items exists or can be generated. The authors 
also assume that the role of the statistical model is to describe accu- 
rately an examinee with respect to that universe. The validity of the 
generalization from the universe of items to the real world Is not in- 
vestigated. The models further assume that a mastery " standard relative 
to the entire universe can be established. Given- these assumptions, 
the prqblem is how to interpret the observations. The following section 
discusses theoretical issues whicli may produce possible solutions. Table 
1 then introduces and summarizes the specific models. 

The problem of setting standards arises because it is often imprac- 
tical to insist upon complete mastery of a task, or even to require a 
very high percentage of correct answers to the items comprising a CRT- 
Furthermore, it is often impossible to list ^ all of the potential items 




Model' 



. It 



Block 



Crehan 



Emrick 



Table >L- 



1 

Sunraiary Comparison of Some Methods and Models Used irAljiterion-Referenceil Testing 



, Theoretical observed 
"Nature of score: x = ^core, 
performance ^n = \ items, A = 
acquisition true abiMty 



Undefined Undefined 



Dayton & 
Macready 



Kriewall- 

Millman 

(Binomial) 



Novick 
et al. 
(Bayesian) 



Undefined 



Pre-instr: x = 0 
Post- ins tr: .x = ni 



All-or-none Nonmaster: x = 0' 
Master: x = n ' 



All-or-norie Ndnmaster; x = 0 
Master: x = n 



Continuous p(x|a) = 



Continuous p(x A) = 



1 X 



True score 
distribution 

!• 



Cutoff score 
specificaition 



Undefined Empirical, based upon 

1 external criterion 

Dichotomous, ^ Empirical; pre-post 

based on pre- instruction ^ ' 

post instruction \ classification } 

Dichotomous, Choose score that best^dichotomizes 
master or • observed score distribution, assuming 
nonmaster guessing and forgetting errors . 

Dichotomous, Choose score that best dichotomizes 
master or observed score distribution, assuming 

nonmaster guessing and forgetting errors. 

Undefined ■ Choose score such that the sum of 
probability of achieving at least 
that score' for nonmasters, and^ not 
achieving th^t score for mast^s is 
'' minimized, 

* 

Beta-binomial Calculate posterior probability 
that observed score exceeds the 
standard. 
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J ' • Table !• (continued) 

Suinmary Comparison of Some Methjds and Models Used in Criteril-Referenced Testing 




Theoretical observed 
Nature of score: x = score, 

perfonoance n f I items, A = True score Cutoff score 

^^'^ acquisition .true ability 4 distribution specification 



Classical Continuous 
regression 



R^sch Continuous p(x|a) 
(logistic) \ 

n e_ 

n 



Normal 



(b, - A) 



(b. - A) 
1 



1 = 1 1 + e 

b, = item difficulty 

x='A-e,' 
where e = 

error of j> 
measurement 



/ 



Norma 




' Choose minimum Rasch abil^ity • 
estimate. Calculate the ability 
'^estimate' from observed sco|:e. 



Choose minimum' "true" score 
criterion. Calculate estimated 
true score from observed score. 
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of a given task domain. For es^ample, an indefinitely, large number of 

• multipfication items could comprise an item universe from "which a sam- 
ple of items are selected. An arbitrary standard would determine' that 
the examinee "ansWering^a specified number (or percentage) of the sam^ 
pie correctly will be classified as a "master" of multiplication. The ^ 
main purpose of the present paper is to evaluate several mathematical 
models- that claim to, reduce the arbitrariness in setting criteria >fory t 
mastery on tests representing a sample of' the test-item urii verse . The 
motivation for deyeloping models by which criteria 'for mastery^ can be 
derived formally arises from the qahl of trying fe)— minimize^ miscl^ssifi- 
cations (i.e., designating a "true master" ite a "nonmaster". or vice 
versa) . The more complex the skills assessed by the CRT, the ^mailer 
the sample of items, and the more varied the type of perfontiance in- 
cluded in the universe, the Qreater the danger of misclassif icatipn . 

r 

I 

Theoretical Problems for CRT Models 

Nature of Performance Acguisition . Is the attainment'^of mastery 
an "all-or-hone" ^occurrence, or is there a continimm of varying degrees 
of skill acquisition? The widely accepted dichotomy of master vs. 
nonmaster may be oyej;Xy simplistic. The altern^ive is a continuum of 
varying degrees of mastery. Both dichotomous , andTcorVtinuous CRT models 
are available in the literature. - 

. M easurement Error . One type of error, similar to the classical 
psychcjpnetric notion of measurement error, refers to random inappropriate 
responjfees due to temporary environmental distractions, lucky guesses, 
lapses in attention, etc. The magnitude of- such error can be estimated 
and included in the estimation of actual ability and In the determina- 
tion of test standards and lengths. 

\\.second type, V^lassif ication" error, refers to the (usually) 
dichotorK)us classification of an examinee as a master or nonmaster. 
Its magnitude and direction are primarily a function of how a cutoff 
.score is chosen. Classification error will tend to increase as the 

* accuracy in estimating actual ability decreases, but a mathematically 
•defined relationship between measurement error and classification error 

^ has not been derived (Guilford, 1956, pp. 380-384). 

Test Length to Distinguish Masters from Nonmasters . One technique 
to improve ability estimation and reduce the chancS foi* misclassif ica- 
tion jLs to increase the number of test items. In some si tuations this 
may be possil^e simply by repeating items until the desired level of 
precision is attained. However, in most cases, >test length cannot be 
indefinitely increas-ed. Therefore, a statis tical' model that provides 
increased infonnatij^ per item is highly desirable. Generally, a CRT ' 
'model should provide sufficient information to decis ionma.kers so that 
they will^know the risks of committing false positive and false nega- 
tive errors before the test is conducted. 
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Overview of Selected CRT Models 



The CRT rtodels discussed in this paper were chosen to try to illus- 
trate-'^the diversity' in approaches to 'the problems outlined in the pre- 
ceding section. Methods developed by Crehan (1974) and Blpck (1972) 
are basic^illy empirical in that 'cutoff scores are bas^d j|rfpon empirically 
^ derived req«irei?ients . Models derived by Er(irj^k {IdlVi and by Mac ready an 
Dayton (1976) Assume a dichotom<ius definitionpf roasfery and analytically 
'describe procedures for establishing cutoff score^^ Klriewall (1969) and.^.. 
'Millman (1972, 1974) assume that^ responses to test items and examinee 
ability can be des.cribed by the family of binomial distributions. Their 
basic model^ can be extended by /applying the theory of binomial error 
^ mqdels (Eord NoVick , 1968)^ Novick and Lewis (1974) discuss the ap- 
plication of a Bayeslan approach to CRT issues. A one-parameter logis- y 
tic model (Rasch, 1960; Wright, 1967) provMes a practical example of . 
^ how latent trait theory may, be applied to CRT data analysis^*. Finally, 
, an*^approach for CRT 'data analysis derived from 'classical regression 
theory, is disoa59^5»^ Each model is examinee^, in terms of rationale and - 
assv\mp\;ions, 'empirical support and applications, illustrative examples 
of':1bhe type of input rec^uired and output provided, and critical 
^^valuation. 

' ' 

- ' REVIEW OF MODELS 

Block ^ ^ . ■ ' ^ 

' Block's (19^c2) resei^h provides an experimental approach to set- . 
ting mastery standards. He studied the relationship betweep the level 
of performance riequired on each unit of a- three-unit instructional se- 
quence and five cognitive and affective outcome variables . The ration- 
ale for this study was the intuitive notion that maximum performance on 
- an external measure of achievement would be observed ip students having 
the most stringent passing requirements in the instruction. A second 
question concerned the relationship between scores on an affective 

measure pf interest and attitude and passing requirements in instruction. 

{ ... 

Block's experiment included four treatment groups that differed 
from one instructional unit to the next with respect to the standard 
required for advancement. If the stddent did not meet the standard 
(65%, 75%, 85%, or 95% of the items co^rrect on a 20-item test) , reme- 
dial instruction was provided. Students in a control group proceeded 
om one unit to the next with' no remediation , ^ tegardless of their test 
core. Five outcome variables were- defined : achievement, learning rate, 
ransfer, ii^erest, and attitude. ^ 

Transfer was measured by a 10-item test which required the use of 
the learned skills to solve a novel set of problems. It Was given both 
as a pretest and after instruction;. Interest and attitude were measured 
usi^ng a 24-item questionnaire. 
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Most of the resets supported, the intuitive hypothesis. The " 0or>r^ 

* troi group did consistently wor^ on axfhli^^Wrnent , trins fg^pr/iS-aJid reten- 
tion, than any of the expetimen"t^al--Mgxdu^ the .J^rfninq curves 'suq- 
ges ted. that high s tandard^learly^ in an ins trpction^al sequence may produce 
increased efficiency later 'in the sequence. Wwever,^ several interesting 
exceptions to the intuitive expectations suggest: that h^qher standards 

. are. not always better standards. For example, the 85% and 95\ groups 
did not differ from one ^^tiother on retention oi? achievement measures ^, ' 
although they both differed frojn tjre 'control cfroup. Only, the 85% qro'Jip 
produced sustained high, levels of interest and attitude. ^ 

Block's research sug^jests that a unitary definiMon of an "optimum" 
CRT cut ti^ score may be^ quest ionable . If uni formly high achievement C /-v 

• and transfer are requirpd at the possible expense of positive interest 
and attitude, it may be that the' highest, mastery standard should be used. 
However, if some «"ijiix" ^cognitive' and affective outcomes is desired, 

j^^hen a lower standard seems appropriate/ ' - , 

Similar studies could be conducted on a wide range 'of instruct^oirtal ' 
programs for a wide variety of butc^omes.^ The results] could lead to. ^ 
usable and meaningful -guidelines for setting cuttihq -^cores to optimize 
a number of instructional outcome^! Because the results may not be qen- 
eralizable;across content areas and instructional programs, such an op- 
timization strategy would ' require costly and extens ive ; research . This 
empirical verification of a decisionmaking strategy for finding optimal 
ihixes of cogni,tiVe^nd Effective- outcomes^ does not mathematically model ^ 
any of the probl^s' outlined in the prfe<rious section of this pa^er . A 
truly, complete scoring and decisionmaking, CRT model would take' into aq- 
counjf: both «the .psychological variables that characterize , optimum learn-. • 

ind the constraints imposed by test length, cutting scores, and 
misclassification rates. . . ' - ' 

\ ■ 1 - . . • , ■ . 

Crehan > . ■ ■ •■ \. ' 

[ ^ ^ . ' • ' ^ 

, . A me^thod used i by Crehan (1974) also . relies ■ heavily on a training ' 
context, for its interprdtatioiV. 'The method's rationale for spepifyinq 
cutfting scores is based upon the comparison of the test scores of stu- 
dents who have completed trailing with the. test scores of ^those who' have 
not yet recei^ved training. This method provides a means of assessing ' 
the proportion of misclassifi^d students ' within each qroup when variSus 
cutting s'cotes are used. , ' 



Correct' classification occurs when posttraining students pass th^ " 
'test and .'Students with no training fail the test. Usip^ a 2* x 2 ^n^trix 
of pass-fa;^! and training-^no training for- each cutting score, the^pro-' 
portion of correct ' classifications P can'be obtained as follows: 

* C r* 

\ - [number who had training and passed +'number who had no train- 

/ ing and failed] sum of all four entries in the matrix. 



A eutrting score is found by choosing the score that maximizes? the.,^0- 
portion of corri^ct ^classifications ^ v . ^ / * i 

For example, assume that the distribution , of scores on a five-i^tem 
CRT for an untrained group and a group^, that ha^ completed training is 
as follows: " , • 



Number Correct'^ 



No Training 



Completed Training 



0 

1 ' 

2 
3 

4 . 
5 



10 
5 . 
4 
0 
1 



0 
0 

1 

5 
10 
4 



A series of fourf9l(3 tables in Table 2 displays the relationships be- 
tween cutting score pass- fa^.1 decisions, and the amount of training. 
P^, the proportion of correct classifications, is calculated for each 
fourfold table. ^ The highest, value of in this example is found when 
three correct resf^npes are used as the cutting scoj-e. Therefore, for 
this training program, a cutting score of 3 would be recommended as the 

optimal cutting score. - ^ ' • 

0 

The mdjor strength of this .procedure is that it provides an esti- 
mate of the optimal cutting' score for differentiating between trained 
and untrained groups while remaining relatively 'simple to implement . 
However, these two groups do not necessarily correspond to the cate- 
garies of "masters" and " nonmasters in terms of the ability of group 
members to complete an objective. Instead, one might expect the post- 
training group to perform less well ^ than a group consisting entirely of 
examinees who have mastered the objective, and the pretraininq group to 
perform somewhat better than a group of examinees, none of who|jj^has 



mastered the objective. 



— The simplicity of Crehan's procedure is partially offset by a num- 
ber of weaknesses, including the following: (1) lack of a procedure for 
estimating the minimum item sample size necessary to keep the probability 
of misclassification at or below some specified level; and (2) lack of 
statistical criteria for differentiating between P^'s which "seem" to 
be similar (or different)-. - . \ ^ 



Mac ready and Day 'ton -^^^^Emrick ' . 

Assumptions and Rationale . Two' related probabilistic models that 
..provide probability estimates of the 2^ possible response patterns on 
a dichotomously scored, n-item test are discussed in this section 
(Emrick, 1971; Dayton & Macready, 1976; and Macready & Dayton, 1975) . 
Both models assume that all examinees belong to one of two possible 
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Table 2 

Example Data Matrices for the Crehan Proce*dure 



Training experience 

Cutting No Completed 

score training traifiincr 



Pass . 20 20 

Fail 0 0 

Pc = 20/40 = .5 

Pass 10 ' 20 

Fail ^ 10 0 

PC ^ 30/40 = .75 ' 

Pass . , 5 20 

Fail 15 0 

PC = 35/40 = .875 . 

Pass 1 19 

Fail 19 ' J, 

PC = 38/40 = .95 

Pass 1 14 

Fail 19 ^6 

PC = 33/40 = .825 

Pass 0 4 

Fail 20 16 

PC = 24/40 = .60 



"true score^types" for any given domain: masters, (M) ; and nonmasters, 
• (M)". Masters are those individuals who have acauired'the necessary 
skills- to respond correctly to all items within the domain. Thus for 
a three-item test with items sampled from the domain of interest, a 
master's true score response pattern would be 111, where a "one" indi- 
cates a correct response to an item. Conversely, nonmasters have not 
acquired the necessary skills to respond correctly to any item within 
the domain; thus thei^r true score response pattern would be 000, where 
a "zero" indicates an incorrect response to an item. This xJichotomous 
classification of individuals appears reasonable to the degree that all 
items within a domain involve the same skill - 

In general, it is assumed that the only way that any non-true score 
response pattern can occur is for a nonmaster to make one or. mor^ cor- 
rect "guessing" errors or for a master to make one or more forgetting 
errors.^ ^or the first model (Macready & Dayton, 1975), the error prob- 
abilities are unrestricted except for the usual 0, 1 bounds for proba- 
bilities- a^ and b^^ represent the probabilities of a "guessing" and 
"forgetting" error, respectively, for item i . Furthermore, P(M)and 
P*(M) represent the proportions of examinees who are masters and nonmas- 
ters, respectively, with the usual restrictions: 0 < P(M) < 1 and 
P(M) + P(M) =1. If local independence among responses is assumed, 
then the probability of the jth observed response pattern on an n-item 
test is 



P(j) = p(j|M)p(M) + r)(j|M)p(M) 

[ n X. . 1 - X. .1 

[' n 1 - x. . X. ."1 

n b. "3 (1 - b.) 



pCM) ^- 



P(M) r (!) 



where x^j = [0,1] is the score of the ith item for the jth response 
pattern. Maximum likelihood estimates of these parameters are obtained 
from test data by means of the Newton- Raphson iteration procedure 
(Rao, 1965, pp. 302-309). 

Because of the relatively large number of parameters (2n + lY under 
this first model, there are circumstances in which it is desirable to 
utilize a second model (Dayton & Macready-, 1976) based on^ a more re- 
strictive set of assumptions; .guessing errors for all items are egual 
(i.e., a^ = a) and "forgetting" errors for all items are egual (i.e., Jd-^ 
= b) . These assumptions, reduce the number of parameters ij to be estimated 
to three for tests composed of any number of items- and al'low for a 
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simplification of the formula defining the probability of the occurrence 
of -the jth response pattern on an n-item test to 



1 " s . . n - s- . 

P(j) = P(j|M) + p (j|M) = a ^ (1 - a) p(M) (2) 

n ~ s : s . 

- ^ (1 - b) ^ p(M) , 

where Sj is the number, of correct responses (i.e., nvimber of I's) in the 
response pattern. 

Macready and Dayton provide a discussion of how these mocjels can be 
used for making classification decisions with respect to mastery of spe- 
cific concepts or skills, and .they provide several examples. The dis- 
cussion includes the development of procedures for (1) assessing the 
adequacy of "fit" provided by the models, (2) identifying optimal deci- 
sion rules^ for mastery classification that incorporate utility functions 
related to costs- of false negatives and false positives, and (3) iden- 
tifying minimally sufficient numbers of items necessary to obtain accept- 
able levels of misclassification. 



Example . For the case of a three-item test, there are eight possi- 
ble response patterns : (000), (001), (010), (100), (110), (101), (Oil), 
(111) - For the first model, the 2n + 1 necessary parameters correspond 
to guessing (a^) and forgetting- (b^) parameters for each item and the 
proportion of subjects in the examinee group who are masters.' Maximum 
likelihood estimates of these parameters are 'obtained from the test 
data. 

For purposes of example for Model I, assume the following^ parameter 
values: a^ = .01, b^ ' = .20; a2 = .05, b2 = .10; as = .10, ba = .05; and 
P(M) = P(M) = .5. This might' correspond to a test in which the items 
appeared to be grow3,ng' increasingly easy. For the second model, only 
three parameters are found: a, b, and P(M). Again for purposes of 
example for Model .FX, assume that the_obtained estimates for the param- 
eters are a = .06, b = .12, P(M) = P(M) = .5. 

To find the probability of observing each response pattern in a 
given examinee group, the probability of observing each response pattern 
given mastery status must be multiplied by the proportion of the group 
in that mastery status. For this example, each response pattern must 
be multiplied by p(M) = P(M) = .5. Table 3 shows the results of these 
calculations. 

The mastery/nonmastery decision rule is based on the score that 
minimizes the probability of misclassification. Probability of mis- 
classification is ^defined as the probability that a master will not 
achieve tjhe cutting score times the proportion of masters in the group 
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Table 3 



Probability of Observing Response Patterns Under the 
Macsceady and Dayton Models, Assuming P(M) = P(M) = .5 









Model I 


Model II 


Response 




P (response pattern) 


P ( response 


pattern) 


pattern. 










Master 


Nonmaster 


000 




.0005 




.423225 


.000864 


.415292 


001 




.0095 




.047025 


. 006336 


.026508 


010 




.00450 




.022275 


. 006336 


.026508 


100 




.0020 




.004725 


.006336^ 


.026508 


110 




.0180^ 




.000225 


.046464 


.001692 


101 


I, 


.0380 




.000475 


^ .046464 


.001692 


Oil 




.0855 




.002475 


' .046464 


.001692 


111 




.3420 




.000025 


' .340736 


.000108 






P(M) = 


.5 


P(M) = .5 


P(M) = .5 


P(M) = .5 


%(M) = ( 


.2° 


X .8^ (. 


1° 


X .9"^) (.05"^ 


X .95°) X .5 = . 


0180. 


^P(M) = ( 


.01^ 


X .99°) 


(. 


05^ X .95°) ( 


.1° X .-9"^) X .5 = 


.000225. 


^p(M) = . 


12^ 


X .88"'" X 


.5 


= .006336. 






^p(M) = . 


06^^ 


2 

X .94 X 


.5 


= .026508. 







plus the probability that a nonmaster will -^qual or oxcood it times tho 
proportion of nonmastcrci in tho qrouf). Tho l^robabi 1 i tiofi for both mocU^l 
and all possible cutting scores are given in Tablt^ 4.' 

The final colutnn of Table 4 indicates that for both model:: the op- 
timal cutting score is 2 correct. Note that although the cuttinq score 
is the same for both models, the misclassi fication under the richer 
Model T is consistently smaller than Model ii . 



Emrick (1971) developed a procedure related to the restricted form 
of the Macready and Dayton model. He generated a function for identify- 
ing optimal cutoff scores in terms of relative costs of incorrect 
mas tery/nonmastery decisions and the ratio of a to b errors . The 
optimized formula is ' - 

\ 



log 



k = 



log 



+ - log 
n ^ 



ab 



L^PCM) 



L^P(M) 



(3) 



(1 - a) (1 - b) 



where 



k - percentage of items correct required for a mastery 
decision; 

= loss incurred from a false positive; 

L2 = loss incurred from a false negative. 

This cutscore value is the same as that suggested by Macready and 
Dayton under their restricted model when the same parameter estimates 
are used. However, Emrick suggests a different approach for parameter 
estimation. He constructs a fourfold table relating true mastery state 
and observed item responses "to a single item, with the cell entries 
being the error probabilities a and b. Emrick then treats a and b as 
response contingencies and computes a 'phi coefficient to indicate the 
correlation between observed single item responses and true mastery 
state: ' , . 

- . 1 - a - b 

phi = ^> (4) 

VI - (a - b) 2 

He uses the average iteritem correlation of examinee responses to com- 
pute an unbiased estimate of the reliability of a single item using the 
Spearman-Brown prophecy formula. 

Since reliability is defined as the proportion of total variance 
that is true variance, it can be interpreted as an unbiased estimate of 
the squared correlation between an examinee's true mastery state and his 
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Table 4 

^Probability of Misclass if i cation as a Function of Cutting 
Score Under, the Macready and Dayton Models, 
Assuming P (M) = P(M) = .5 



Cutting 


P (False 


negative) 


P (False positive) 


P (Misclassification) 


score 


Model I 


Model II. 


Model I 


Model II 


Modq.1 I 


Model II 


0 (all 
pass) 


0 


0 


.5 


.5 


.5 


.5 


1 

2 
3 

4 (all 
fail) 


.0005 
.01650^ 
.i580 
.5 


.000864 
.019872 
.159264 
.5 


.076775 
.0032^ 

.000025 
0 


.084708 
.005184 
.000108 

0 


.077275 
.ai97 
.158025 
.5 


.085572 
.025056 
.159372 
• .5 



' The probability that a master will b^ misclassif ied when the cutoff 
score is set at 2 correct equals the sum of the probabilities that a 
master will get only 0 or 1 items correct times the proportion of mas- 
ters in the group. For Model I, this probability equals .0005 + .0095 
+ .0045 + .^002 = .0165. For Model II, .000864 + 3 (.006336) = .019872. 

c d ■ t ' 

' The probability that a nonmaster will be misclassif ied when the * 

cutoff score is set at 2 correct equals the sum of the probabilities 

that a nonmaster will get 2 or 3 items correct times the- proportion of 

nonmasters in the group. For Model I, this probability equals .000025 

+ .0n?475 + .000475 + .000225 = .0032. For Model II, .000108 + 

3(.00 592) = ^005184. 



or her item response. Hence, item responses, true mastery state, and 
error probabilities can be directly related through the test reliabil- 
ity. If the ratio of a to b is known (or if it can be estimated) , 
values for a and b can be directly calculated. 

For the Macready-Dayton model example values (a = .06, b = .12), 
the value of phi is .821. Squaring this} value and appj^ying the Spearman- 
Brown prophecy formula for a three-item^test indicates that the test re- 
liability for this example would be .86. Assuming a loss ratio of 1 and 
equal proportions of masters and nonmastets, the value for k in Emrick's 
optimization fonnula is .4339. This implies a cutting score of 1.3^ on 
a three-item test, or rounding up to the next higher integer, 2. Thus, 
the final result is the same as the result obtained with Macready and ' 
Dayton. p 

Evaluation . An important constraint of this approach is that the 
proportion of masters and nonmasters must be equal. (The computations 
for the preceding example and a more general form of the Emrick model 
are presented in Appendix A.) 

Other possible weaknesses in Emrick *s approach to parameter esti- 
mation are the sub jeofeivity required and the somewhat overly restric- 
tive assumptions necessary to implement his approach. In addition, the 
complexity of both conceptualizing and quantifying and L2 may greatly 
complicate 'the derivation of cutoff scores under these models. ' 

If the assiamptions are met, an optimal differentiation between 
masters and nonmasters will result. Furthermore, a means is provided 
to determine how many items are needed to keep the probability of mis- 
classification at or below some specified critical level. The relation- 
ships among test items may also be explored. A major potential weakness 
concerns the assumption that learning occurs in an "all-or-none" manner, 
with no partial learning or overlearhihg . Failure to satisfy this as- 
sumption could produce a poor fit of data to the model, which will in 
turn produce a far less than optimal cutting score. 

Binomial Model 

Assumptions and Rationale . In contrast to the all-or-none learn- 
ing assumption of the Emrick and Macready models as the assimiption that 
l^eaming is a continuous process. -A binomial distribution model, first 
suggested and derived by Kriewall (1969) and siabsequently developed by- 
Millman (1972) defines proficiency as, the probability that a person 
will correctly respond to any test item randomly chosen from a sp'feci- 
fied domain of items. Proficient may also be defined as the propor- 
tion of items that would be correct if all items in the domain could ^» 
be administered. Since the proficiency value can take on values from 
zero to one, the model allows for partial acquisition. 

.-.'3 ■ 
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The following assiunptions are pertinent: (1) dichotomously scorable 
items, (2) locar independence of items, (3) no systematic learnrfig or 
forgetting during test taking, and (4) items equally difficult for any 
given examinee. The percentage of items answered correctly is taken as 
a point estimate of the examinee's true proficiency. For a' ^given, pro- 
ficiency, the probability of observing any score may be determined. The 
hypothesis to be tested in this model involves the likelihood of ,^a speci- 
fic score, if indeed the examinee had the given level of *'prof iciency . 

^ The basic efquation for the binomial model yields the probability 
distribution of scores, for an examinee with proficiency "p" for repeated 
random samples of items of size "n" from a given domain of items : 



ff'^) = I "Ip'' (1 - p)" , ■ (5) 



whe re 



/ 



X = the total/ number of cotrect responses, 
f (x) = the probability of test score x. 



(0 



= the binomial coef fi^cient : 



(n - X) 



The binomial model can be used to provide two types of information. 
First, the proportion correct is the maximum likelihood estimate of an 
individual's proficiency relative to the particular domain. ^Second, the 
model can be used to investigate the interaction between tes^ length and 
classification error when individuals are divided into two. groups . One 
group will contain students with proficiency greater than or e^ual to 
some minimal proficiency criterion. The other group will have students 
with proficiency levels less than or equal to some maximum nonmastery 
criterion. 

To calculate the expected error in decisionmaking, it is necessary 
to specify two parameters. The first is the lowest proficiency level 
requi/red for an individual to be considered a master. The second is 
the highest proficiency level that a student could obtain and still be 
consiciered a nonfcaster. When these values are set by the decisionmaker, 
the probability bio false negative and false positive errors for minimal 
masters and maximal nonmasters, respectively, can be calculated for any 
given test length and cutting score. This procedure, it should be noted, 
is generally conservative. That is, if the group contains examinees 
with abilities above minimal mastery or below maximal nonmastery, the 
number of misclassif ications observed will be less than that predicted 
by the mojdel. 
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Example . Suppose that a cutoff score of 80% correct was selected 
(i.e., in order to be classified as a master, a student must get cor- 
rect at least 80% of whatever number of items are included" on the test) . 
Assxame also that a true proficiency of 90% is defined as the minimal 
mastery level, and that a true proficiency of 70% is defined as the 
maximal nonmastery level. The region between these cutoff scores is 
an "area of indifference."' That is, if an examinee's true proficiency 
lies between 70% and 90%, the decisionmaker would be indifferent as to 
whether the examinee is classified as a master or as a nonmaster. 



Values for misclassi fication error that can be tolerated must also 
be specified. Continuing with the above example, assume that the de- 
cisionmaker is' unwilling to accept more than 26% of the students whose 
true ability is 70%, and he or she wants to reject not more than 19% of 
those whose true ability is 90%. Thus, the probabilities of a false 
positive and false negative are .26 and .19, respectively. Given these 
values, it is possible to determine the minimal number of test items. 

The follpwing notation will be used: 

n = the total number of test items, 

c = the cutoff score (in this example c = .8n or the next highest 
integer value of .8n since an 80% standard was chosen)", 

X = the observed score, and tl^e formula for cumulative terms of 
the binomial distribution is 



: , (:) 



X n - X 

P (1 - p) . • (6) 



Specifying that the protDability of falsely rejecting a master miost 
not exceed .19 means that the cumulative probability^^f a master ob- 
taining a score from 0 correct to c - 1 correct must\^ot exceed .19. 
This constraint may be expressed as the inequality 

F(x 1 c - 1) < .19. (7) 



There fore. 



X = c 
= 0 



.19 < I (.1) 



where p = .9, the minimal mastery level. 

A similar relationship exists for nonmasters. Since the probabil- 
ity of falsely accepting a nonmaster must not exceed .26, the cumulative 
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probability of a nonmaster obtaining a score greater than or equal to 
q must not need exceed .26. The inequality for nonmasters is 

- F(x c) £ .26. ^ . (8) 

There fore, 

n 

.26 < E 

X = c 

where p = .7, the maximal nomnastery level. 

Reference to a table of cumulative terms of the binomial distribu- 
tion shows that the minimiJin value of n for which these relationships 
hold is 8. 

Since ,.8 (8) = 6.4, a cutoff score of 7 correct is chosen. Sub- 
stituting these values for c and n yields 

.19 = E ^ (.g)"" (.1)^ and (9) 

X = 0 V/ 

.26 = E ^ 1^^^ (.7)'' (.3)^ *■ . ' (10) 

These are the numerical vsolutions for the above inequalities. 

The conservative nature of the model results from the fact that, 
the calculations are based on two point values of true proficiency, 
70% and 90%. The previous calculations reflect the probabilities of 
false -positives and false negatives, assuming that the examinee group 
is composed only of people with true proficiencies of °70% and 90%. How- 
ever, if an examinee had a true proficiency of 95%, the probability that 
he or she would obtain a score of less than seven correct out of eight 
items, and therefore be classified as a nonmaster, may be expressed as 

Z ^ Q {.95)'' (.05)^ " - .06. , • (11) 

This value is considerably less than the probability of a false negative 
as previously obtained, .19. 

On the other hand, if a person had a true proficiency equal to 60%, 
the probability that he or she would obtain a score of seven or more 
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correct on an eigh't-it|em test, and therefore be classified as a master, 
may be expressed as^ 

V^) cei.- (.4)^--= .11. ° (12) 

■ ' 

This value is much less than the probability of a -false positive as pre 
viously obtained/ .26. 

Millman(1972) has prepared tables which allow the decisionmaker 
to reach thes? same conclusions without calculations. His tables also 
give the expected misclassification error for a variety of test lengths 
cutoff percentages, and true ability levels. 

Evaluation. The binomial model actually describes the worst po's- 
sible situation. For most practical applications, the examinee popula- 
tion will contain persons with true ability above the minimal mastery 
level and below the maximal nonmastery level. To arrive at a more 
realistic estimate of total misclassif ication; . the equations would" have 
to be -solved for each representative ability and be weighted by the 
proportion of the group with each ability. Such procedure is, of 
course, feasible but its value is questionable. The values obtained 
'from the simple procedure are overly pessimistic; any decision derived 
from empirical data could be no Vorse, and would probably be better. 

A virtue of 'this model is that it is relatively straightforward, 
being based on the familiar binomial distribution. It is one of the. 
simpler quantitative models to derive test lengths and cutting Scores. 
The model can be criticized, howeverr^cause o^ its conceptual founda- 
tions. Specifically, the output of th^ model tells us the probability 
that a student will attain a certain testjscore, given his or her true 
ability level. However, it is by no me an s,,^clear or obvious that the 
decisionmaker would know the student ' s. true level of functioning. In- 
deed, if the true ability level were known, there would be no need for 
models to determine test length and cutting scores. In using the bino- 
mial model, the decisionmaker has to set estimated (or desired) limits 
on the true level of functioning, of the student. This allows him or 
her to infer the conditional probability of the observed te^ score, 
given the hypothesized^ level (s) of proficiency. " This binomial model 
IS most useful for initial apprqximations of test length and cutting 
score before test data have been collected . 

f 

Bayesian Nlodel 

Assumptions and Rationale . If information can be obtained about 
the quality of the examinee population (perhaps on the basis of pre- 
vious similar populations) before the test scores are observed, then a 
Bayesian model may be appropriate for deriving test lengths and cutting 
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scores. The input consists of an estimate of "the ability distribution 
in the examinee population, and the conditional probabilities that a 
randomly chosen item Would be answered correctly given some ability 
level. The output is t.he conditional probability that an individual's 
ability equals (or, in some cases, exceeds) some criterion abitity, 
conditional upon his or 'her^test score. 

The Bayesian, lik^ the binomial, model makes the following assximp- 
tions : (1) items must be', dichotomously scored, (2) responses are inde- 
pendent, (3) items are equally difficult for any given examinee within 
a particular ability grpup, and (4)- there is no systematic learning or 
fatigue during test taking.' As in the binomial model, ability is de- 
fined as the probability of , responding correctly to a randomly chosen 
item from the domain. We will continue, to use the term proficiency 
(p) when referring to this definition of ability. 

Examples > The first model to be discussed assumes i 2 discrete 
states of mastery. 

Epstein and Steinheiser (1975) developed a two-step algorithm based 
on work by Hershma"^ (1971) . The first step yields the probability of 
an examinee being in mastery state i, conditional on an item.^core: 



p(t|M.) p(M )' 
P(M. |t) = . 



(13) 



Z p(ttM.) p(M^) 
i =. 1 



where s = the number of states, 

t = the item" score (0 or 1) , 

= the mastery state being considered, 
p(M£) = the prior probability that ah individual is in mastery 
state i, and 

p(t|Mj^) = the fjrobability of the score t, given the mastery _state . 

The second step in the procedure combines the decisions for each 
item into a final probability of. being in; mastery state i, given the 
total test score: 

n • • 

n 

j = 1 p(M |t ) . • 
P(M, |t) =- \ \. „ (14) 



p(M^) 



n - 1 



s 

i = 1 



n 

n 

ii = 1 



— ' — '■^ — =1 

P(M^Itj) 



p(M^) 



s 
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where 

. ' ' , ■ f , 

j 1/ 2, . . . 1^ = the number of items and 
T = the total test ^ore.* 

' For example, consider the case previously described for the'bino- 
mial model. Two mastery states are assumed, minimal mastery and maxi- 
mal nonmastery. , ^ 

a 

For the minimal mastery state (Mj^) , p(tj = correct (1) |mi) = .9 and 
p(tj = incorrect (0) Im^) = .1, for all j. 

Fox the maximal nonmastery state^(M2) / P(tj = correct-^ (1) |m2) = .7 
and p(tj = incorrect (0)|m2) = .3. . 

Values must be given for the priors, p(Mi) and p(M2) . Their value 
may be determined on the^'lDasis of past experience, or may simply re- 
flect the beliefs or expectations of the evaluator. Three cases will 
be considered: pCM^) =:p(M2) = .5; p(Mi) = .12, p(M2) = .88; and 
p(Mi) = .62, p(M2) = .38. These 'correspond to little prior informa- 
tion,^ relatively loy expectations, and relatively high expectations. 
The example was computed fo'ir an observed score of seven correct on an - 
eight-item test.' The results are shown in Table 5. 

' For Cases 2 and 3, where prior information favored the nonmastery 
and mastery states, the final decision can be made with a relatively 
Ijigh degree of confidence. For the case of little prior information. 
Case 1, the probabilities of misclassification^re greater, the ef- 
fects on the final decision of the priors ar^ also clear. For the equal 
priors case, ^e weight of the observed evidence favors a mastery deci- 
sion. However, where the nonmastery state is favored in the prior 
probabilities^ (Case 2) , the evidence does not overcome the priors and 
a nonmastery decision is m^de . 

Whereas the Epstein and Steinheiser technique seems to offer a • 
method for reducing the uncertainty in decisionmaking- for a given num- 
ber of test items, their procedure is limited by the constraint that 
only discrete mastery groups are considered. The s^econd model , to be 
reviewed deals wi^th continuous distributions of proficiency and classi- 
fies examinees based upon the probability that tfe^li'^^oficiency equals 
or exceeds some minimal criterion. Novick and/LeWis (t574) achieve this 
by assuming tha^ the distribution of examinee p'tcfficiehcies can be. ap- 
proximated by a member of the family of Beta distributions. The prob- 
ability of achieving any score-of interest, given the proficiency, 
remains binomial. The 5orm. of Bayes • Theorem is then a probability 
density function' of the (form p(t|x) = p(x|T)p(T) , where T is the pro- 
ficiency and X is the telst score^. 

If p(x|t) is binomial and p(T) is a Beta distribution, then p(t|x) 
will also be a member of the Beta family. In fact, if the prior 



21 



7 

/ 



> Ta±)J.e -5 

Changes in Posterior . ProbabiMty of Mastery as a Function 
of Changes in Prior Probability of Mastery 









Prior 










.5 > 




.12* 


.62 


■pCM^) . 




.5 ' 




.88 


.38 


^ 


















Posterior 






p(M^|T).i(;;^_ 


- 1 


.66 




.205* 


.767 


p(M2|t) 4 

\j> • i 




\33 




.796 


.242 



*Coinputati6nal st^psf p(t_. = 1) = .12 x .9 + .88 x .7 
p(t. = 0) .12 £\ .l'*+ .88 X .3 = .276 

p(<M |t. =.1) =:('^12%. .9)/. 724 = .149 

-'- D \ 

p(M |t. =.0) = (.12 X .l)/.276 = .043 

^ \ ^. ^ , ' ■ 

ITp(M_|t.) '4 .149^.043) = 7 X lo"® 

P(M^|t) = 7 X l(^^m X 10"*®/.12^ + .3097.88^)] 



= .724 



ptM^lT):- .309/ 



= .205 



n/36 + .755) ] = .796 



p(M. 



(.88 X .7)/. 724 = .851 



p(M |t. 0)lf= (.88 X .3)/. 276 = .957 

np(M |t.) = .851^ (.956) = .309 
^3 
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distribution is Beta (a,b) (i.e., B(a,b)) , and a/^core of x is ob- 
served m n trials, then the posterior distribution is B (a + x. 
b + n - x) . ' 

Continuing with the previous example in the continuous framework, 
we shall now- consider three prior distributions. Integer values of 
a and b, th6 parameters of the Beta distribution, will be used. We 
may therefore use the Incomplete Beta function Ip(a, b) , which has the 
following relationship to the cumulative binomial distribution- 



n /n\ 

y / I X n - X 

^ ( I p ^ 

X = x'\x/ 



Ip(x', n - X' + 1) , (15) 



where n is the number of test trials,, p is the probability of 

on a randomly selected' trial, and x' is the observed number of successes. 



success 



Tabled values are available (Beyer, 1966, Table III. 2). For non- 
xnteger values of a and b, programed numerical methods may be required 
(Noyick & -Jackson, 1974) . 

For the first example, assume that little is known about the exami- 
nee population, i.e., a randomly selected examinee may get a test score 
that would place him or her in the mastery or nonmastery category with 
equal probability. In terms of the Beta distribution, this means that 
examinee proficiency would be rectangularly distributed, resulting in 
a-1, b-1, orB(l, 1) (Novick S Jackson, 1974, p. 114) . 

For the second case, assume that the prior probability that a ran- 
domly chosen examinee has pro:^iciency greater than or equal to .8 is 
.12, I.e., P(p > .8) = .12. Therefore, 1 - p is used to' enter the 
•cumulative binomial table at the top (since tabled p values stop at 
p = .50), and .12 is the table value. 

However, we cannot use the table until one more parameter is speci- 
fied; so let us assume that 'the examiner's "certainty of prior belief" 
canbe quantified as being equivalent to the information that would be 
available if a 10-item test were given (Winkler, 1972, p. 187) . With ^ 
n - 10^ we find that an entry with a value of .12 in the .20 column , 
for n - 10 has an associated x' value equal to 4. Unfortunately, x' 
does not equal 4, due mainly to a limitation of the table, since p 
values stop at .50 and do not extend to .80 or b^eyond. No^e, however, 
that If we let x- =4 in the cumulative binomial, and subtract the 
result from 1, we obtain . , 

« 

10 /n\ 

1(1 (.2)^(.8)" " ^' w'^i^h equals 1 - .1208, or .-88. 
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If the table had extended to p = .8, then the value .879 would have 
been found as the entry corresponding to n = 10, and x' =7. Hence, the 
value for 'X' is 7. Siibstituting x* 7/and n = 10 in equation (15^f we 
obtain Ip(7, 4) as the Beta distribution^ which represents the priory 
information that P {p >_ .8) = .12 is equivalent to 10 additional test 
trials. 

The third exdrnple considers that the prior probability of a ran- 
domly 'chosen examinee having proficiency greater than or equal to .8 
is .62 — which is also comparable to information that could be obtained 
from a 10-item test. Again, entering the table with n = 10, 1 - p = .2, 
we find that a tabled value of .62 this time corresponds to x' =2. 
Substituting x' = 2 in the cumulative binomial and sxibtracting that 
result from 1 yields .38. Again, an extension of the table to p = .8 
^ would show that when n = 10, a tabled value of .38 corresponds to an 
x' value of 9. Therefore, the parameters for the Beta distribution in 
this case are Ip(9, 2) . 

Having thus derived the prior distributions, let us now consider 
some hypothetical test scores, and then derive the^ posterior distributions. 

Suppose that a score of seven correct on an eight- item test were 
observed. Then the posterior proficiency distributions will be B(a + 
number correct, b + number of trials - "Yiumber correct) . For the three 
examples, we therefore have B(8, 2), B(14, 5), and B(16, 3). 

The posterior probability that .an examinee with a- score of seven 

correct out of eight items has a proficiency greater thfan or equal to 

.8 (i.e., P(p >^ .8 I 7, 8)) can be found by determining the area in the 

upper tail of the appropriate Incomplete Beta function (Winkler, 1972, 

Table 5; Schlaifer, 1969, Table T3 ; Novick & Jackson, 1974, Table A-14) . 

For the three e.xamples , these values are: I ^(8, 2) = .56; I ^(14, 5) = 

.28; and I ^(16, 3) = .73. 
. o 

since the origin of these values may not be intuitively obvious, 
we shall outline the steps required to complete the first example, using 
the Novick and Jackso^ tables. 

Step 1: Since p > q, reverse the order, and enter the table with 
p = 2 and q = 8. ^ 

Step 2: -The table gives the cumulative area (of proficiency); 
however, since we want to determine the area in the upper part of the 
Beta function, we need to siibtract the stated proficiency of .8 from 
1, and thereby obtain .2. This represents the symmetric area in the 
lower 20% of the distribution. 

Step 3: .2 lies between the tabled values of .1796 and .2723, 
with associated probabilities (fractiles) of those tabied proficiencies 
equal to 50%-'and 75%, respectively. 



step. 4: Interpolation yields the fact that a 20% on less pro- 
ficiency would occur 56% of the time; therefore, 80% or greater pro- 
ficiency should also be observed 56% of the time. 

Novi'ck and Jackson also provide a convenient set Of charts (pp. 
122-123) -for rapid approximations, although it should be noted that for 
the current example, 'the solution is found to be .44 from their, chart 
A. This value must be subtracted from 1, since the .44 represents the 
cumulative area in the lower portion of the B(8, 2) curve. 

If the probability of. having a proficiency greater than or equal 
to .8 must be at least .5 for an examinee to be classified as a master, 
then a score of 7 out of 8 would lead to a mastery classification only 
in the first and third examples previously described. The weight of 
the low prior reversed the decision rule in the second example. 
*■ 

For another approach to deriving prior distributions, assume that 
prior information can be described as equivalent to 7 correct on a 10- 
item test. (This is an assumption not without criticism, as we shall 
note in a subsequent section.) Assume also that proficiency is dis- 
tributed as Beta — a helpful and reasonably appropriate assumption. The 
mean of the examinees' proficiency then equals (x/n + 1) or 7/11 = .636. 
The variance equals •x(n - x + l)/(n + l)2(n + 2) = 28/1452 = .019. 
Since the parameters are integers, we may once again use the cumulative 
binomial as a means of obtaining the Incomplete Beta density function: 

n = lO^Yio' 
I (7, 4) = Z ( IpV^ ' ^ 



P 

X = 



PCa +b) X 



'Equation (16) is the probability that a given proficiency is less 
than or equal to p. We can compute this probability by assigning spe- 
cific values to p, as shown in Table 6. The values for P(p > p) up 
po the 50th fractile may be found directly (Beyer, 1966, TabTe III. 2) 
for x' = 7 and n = 10. Values' for .6 and -greater can be computed ac- 
cording to the cumulative binomial equation (16) . When the values 
obtained (as in Table 6) are plotted, the result is a smooth ogive- ^ 
like curve (Winkler, 1972, pp. 153, 186; Schlaifer, 1969, p. 438). 

. To plot tl-ie proficiency distribution, we may use the Beta distribu- 
tion function : 
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Table 6 



Ciunulative Estimation of Prior Probabilities for 
Various Assiuned' Proficiencies 



I (7,4), or 
p = Proficiency p(p < p) 



.1 


.0000 


.2 


.0009 


.3 


.,.0106 


.4 


.0548 


.5 


.1719 


•6 


.3823 


. 7 


.6496 


•8 


.8791 


\9 


.9872 



39 



26 



Values of the proficiency (p) may be chosen, but a = x' = 7, and 
b = n-x' +1 = 4. Since (n + 1) = n! for integers, we can easily 
.solve equation (16) : T{a + b) = T {11) = 10! = 3.6288 x 10^; r(a) = 
r(7) = 61 = 7.2 X 102; r(b) = r(4) = 3! = 6. Therefore, T (a + b) / 
r(a) r(b) = 3.6288 x 10^/(720) (6) = 840. Table 7. shows how values of 
f (p) may be obtained. 

A plot of the tabled values for p on the abscissa and f (p) on the 
ordinate could then be made. Such plots may also be found in Winkler 
(1972, sec. 4.3 and 4.4), Schlaifer (1969, sec. 11.1.2) and Novick and 
Jackson (L974, p. 112). Note that this is a prior distribution of 
hypothesized prof iciencie.3 in which we assumed at the outset that the 
information could be characterized as comparable to the information 
that would be obtained from observing a score of seven correct on a ten- 
item test. 

Evaluation. Bayesian models offer the possibility of enhancing 
the assessment of examinee proficiency by using prior information, e.g., 
knowledge that content experts or examiners have about previous similar 
examinee populations. As the validity and accuracy of this prior in- 
formation increases, fewer test items will be needed to achieve a* given 
level of classification accuracy in comparison to the binomial model 
and in comparison to the Bayesian case of equal priors. As more is 
known about the examinee population (i.e., the more that prior informa- 
tion departs from a B(l, 1) distribution), the more the variability in 
the posterior distribution is reduced, and the more the number of items 
to^a4:taPin a desired level of accuracy is reduced. 

In comparing the binomial and Bayesian models, note that the former 
produced as output the probability of observing a specific score condi- 
tional upon some hypothesized true .ability level. In the spirit of 
classical hypothesis testing, one need not know anything about an exami- 
nee's proficiency, except that he or she is more or less likely to come 
from the mastery side of the cutoff score. Since some true level of 
functioning must be hypothesized, it is possible to determine the prob- 
abilities of falsely passing a nonmaster and falsely failing a master 
if the test score suggests a true proficiency level either above or - 
below the hypothesized true level of functioning. 

In contrast, the Bayesian model provides as output the probability 
that a specific examinee has a true ability equal to or greater than the 
criterion (minimal) ability, conditional upon the observed *test score. 
But since no true ability was hypothesized, false positive and false 
negative error rates cannot be specified as was possible with the bino- 
mial model. While both models give the probability that an examinee is 
a member of some ability level group, the binomial 'estimate refers to 
the probability of a score occurring concJitional upon the- assumed true 
proficiency? whereas the Bayesian estimate refers to the probability of 
a specific Examinee being at Or beyond some proficiency level conditional 
upon his or her observed test score. 
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Table 7 

Point Values for Prior Proficiency Distribution 



Proficiency _ hi a - 1 b 

values (1 -'p) : f(p) = 840(p) (1 - p) 











_7 ' 






10-" ■ ' 


.1 




■ 7.29 
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10 '■ 


6.12 
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2.75 






.2 




3.28 
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10 • ^ 


















.3 . 




2.50 
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10-' 
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10-^ 




























7.44 
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8.85 
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10-' 
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1.64 






.5 




1.95 
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10-' 
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.6 




2.99 
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10-' 


2.51 
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2.67 






.7 




^3.18 


X 


10-' 
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10° 


.8 




2 .IP 


X 


10-' 


1.76 


/io° 


















.9 




5.31 
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io-' 


4.48 


X 


10-^ 



■^7 There are several di fficul ties Confronting the potential user of a 

Baye^ian model ,for CRT purposes. First, the mathematics can become 
rather cumb'ersome, since the Beta distribution must be used when ability 
is assumed to be* distributed continuously. Second, a methodological 
difficulty arises in the determination of prior probabilities (Winkler, 
1972, sec. 4.8) . '-.It is methodologically unsound to merely ask the exam 
iner c^\^xpert to /'state his priors," since simple human judgment of 

.probabilities is 'o^en .unreliable, inconsistent, and distorted (Kaplan 
& Schwartz, 1975) . A method used in the present paper — equating prior 
information to comparable test length and score information--may be 
suitable tor. purposes of illustration, but it may be difficult to im- 

•plement in applied SQi:tings. 

• There is at present a dearth of research about how orior probabili 
ties can actually be obtained from experts. Perhaps a pkir comparison 
or forced-choice procedure could be used in which various combinations 
of proficiency (or expected scoires) and associated probabilities are 
presented to the expert (^teinheiser, 1976). Thus, th- judge's prior 
distribution would be directly obtained, and the be:.t fitting Beta' 
distribution used to* provide the necessary parameter values. 

Rasch's One-Parameter Logistic Model 

Assximptions and Rationale . The latent trait model developed by 
Rasch (1960, 1961, 1966) is claimed to yield person-free test calibra- 
tions and item-free person measurements (Wright & Panchapakesan , 1969). 
.The model attempts to reproduce an item ty score group matrix in which 
n items are ordered by their dif f icul ties , . efnd n -'1 score groups are 
ordered by the raw scores. Cell entries represent the*^,robability that 
item i will be passed by a person in score group j (Whitely & Dawis, 
1974) . 

There are two parameters in the model. The first is person ability 
A; the second is item difficulty D. The odds (O) of a person correctly 
answering an item are equal to the product of the person's ability times 
the item's difficulty: O = A x D. If we express the odds as a prob- , 
ability, we find that the probability P of a person with ability A suc- 
ceeding on an item with difficulty D can be expressed as A x D 



1 + Ax D 



1 + A X 

Replacing A and*D with their logarithms, log A = a and log D = d, we 
may finally express P as a logistic function (Wright, 1967) : 



P = 



1 + e^-^ - 



(18) 



This' model assumes that (1) all items measure the same unidimen- 
sional trait; (2) all items have equal discriminating power and vary 
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only in difficulty (the restriction of a common discriin^j^^^^^Qfy in 
results in a set of nonintersecti ng item characteristic ^^j^ves 
differ only by a translation along the ability scale) ; sut>^^?^, 
and items are locally independent; (4) guessing eff^^^s ^^^"^966) ' 

and (5) there is no time constraint on answering items (j^^^h, 1 

Tests comprised of items all of which fit the mod^^^ ^^^^ the foU 
lowing properties (Wright & Panchapakesan , 1969; Whitely ^ DawiS;^^^^4) 
(1) estimates of item difficulty parameters will^^ct di^^^^ si^^'I' 
cantly forrany sample of examinees; (2) estimates of P^ir^Qji al^i.^^ 
will not differ significantly for any sample of calit>r^^^^ i^^^^'val 
individual ability estimates can be measured on at leas^ ^ "^^^^ties' 
and perhaps a ratio scale (Wright, 1967); (4) the scal^ ^^^"^^ia\i 
is defined regardless of the characteristics of the ^^^j^^^ P° t is^^ 
who take the test; and (5) a unique standard error of ^^^g^^ei^^^ ^ 
associated with each ability level - 

The significance of the Rasch logistic model i^^^ app^^^"^^^^^ 
by comparing it to ''classical models of test developmei^^ ^ 

A psychological test having these general charact^^^^^^^^s 
would become directly analogous to a yardstick th^^ ^^^^^res 
the length of objects. Tj^at is, the intervals on ya^^^ 
stick are independent of the length of the objects^ the 
length of individual objects is interpretable ^^^^^O;^^ ^re^ 
spect to which particular yardstick is used. ^^ntrast/ ^ 

tests developed according to the classical model h^^^ nei^^^^ 
characteristic- The score obtained by a person is inte^ 
pretable without referring to both some norm g^<^^P the 
particular test form used. ... No longer woul^ ^^Uivale^^ 
forms need to be carefully developed, sine:e rneasur^^^^^ j^s 
instrument independent and any two subsets ©f ^^librate 
item pool could be used as alternative instruments, siini'' 
-larlyf independence of measurement from a particul^^ pop^-^^ 
tion distribution implies that tests can be used persons 
dissimilar from the standardization population ^"J^^hout the 
necessity of collecting new norms (Whitely ^ Daw.iis^ 1974 f 
163-164) . 

Examples , Calibrating a test using the Rasch Hiode^ ^ggul^^ ^.^ ^ 
logarithmic ability estimate being assigned to every P^^s^^le th^^^ 
This estimate indicates the amount of ability requi^^ to achi^^^.^^^^ 
raw score. A comparison of the ability estimates ^^^^^^^ed to f ^^cat 
raw score by two samples with different ability distribi^^^^^g ^^tly 
the degree to which , the Rasch model calibrates a test incj^pgnden 
the ability levex of the calibration sample. 

Wright (1967) studied the responses of 976 begi^^in^ ^^^^^^^^ 
to 48 reading comprehension items on the L.S.A.T. ^^tain s^P , 

with different ability distributions, he selected two Coj^^^^sti^^ 
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groups from his total sample. The lower group included the 325 students 
who did poorest on the test, with a top score of 23. The higher group 
included the 303 students with the highest scores, with a bottom score 
of 33. Wright compared the similarity between the two sets 6f Rasch 
ability estimates and the two sets of percentile ranks. Figure 1 shows 
the results, in terms of "person-bound test calibration," where a plot 
of raw score against percentile rank clearly shows two different ability 
groups. If a person is said to be in the nth percentile /reference must 
be made to which group that person belongs. 

After subjecting these same data to the Rasch logistic analysis, 
the test scores are transformed into ability measurements along the 
ordinate. Figure 2 shows that the curves for the best and worst exami- 
. nees almost completely overlap. 

The difficulty estimates based upon these dichotomous examinee 
groups are statistically equivalent. Therefore, these estimates are' 
independent of the ability of the examinees in the calibration sample, 
and may^be used over the entire range of ability. Comparing the cali- 
I bration curves of these figures shows the contrast between (1) calibra- 
tion based upon the ability distribution of a standardizing sample, and 
(2) calibration that is free from the effects of the ability distribu- 
tion of the examinees used for the calibration. 

can ability be measured in a fashion that 'frees it f#om dependence 
on the use of a fixed ^et of items? If a pool of test items has been 
calibrated on a common scale, can any set of items be selected from that 
pool to make statistically equivalent ability measurements? 

Wright (1967) tested these hypotheses by making it as difficult as 
possible for person measurement to be item free. He divided the origi- 
nal test items into two non-overlapping subtests, tl?e easiest items 
comprising one subtest and the hardest items • comprising the other sub- 
test. The model predicts that ability estimates based upon the easy 
subtest should be statistically equivalent to those estimates based 
upon the hard subtest. 

The solution required converting the scores to log abilities, .and 
then standardizing the differences^ in ability estimates. First, for 
each score, the corresponding log kbility on the calibration curves was 
obtained (see Figure 2) . For each pair of scores (from the easy and 
hard subtests)., a pair of estimated log abilities was obtained. Then, 
a standardized difference was found by dividing the difference between 
the easy and hard subtest ability estimates by the measurement error 
of the differences. If the ability estimates are statistically equiva- 
lent, then the distribution of standardized differences should have a 
mean equal to zero and a standard deviation equal to one. The obtained 
values were .003 and 1.014, respectively. 
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Figure 2. "Person-free" test calibration using the Rasch logistic model ("o" = 
vU^pted from Wright, 1967). , 



Applications . A more detailed example will show how the Rasch 
model was used to analyze the results of a criterion- referenced test 
(Kifer & Bramble, 1974). The data, were obtained from 201 college stu- 
dents;^ taking an 84-item multiple choice examination in introductory 
,edurational psychology. ' After discarding items that did not fit the 
mcuflel, the final test contained 68 items. 

Comparison of the Rasch-derived ability estimates to a criterion 
score can .proceed in two ways. 

The first is analogous to determining the probability of committing 
a Type I error in classical hypothesis testing. That is, if the cri- 
terion ability corresponds to the null hypothesis, we must determine the 
probability that an obtained ability could have arisen from random sam- 
pling from a distribution with a mean equal to the criterion ability and 
a standard deviation equal to the error associated with the criterion 
ability. 

The second is analogous to determining the probability of commit- 
ting a Type II error in classical hypothesis testing. That is, giVen 
an obtained ability estimate and associated error, (standard deviation) , 
we seek the probability that' the criterion ability could have been ob- 
served from random sampling from the distribution corresponding to the 
obtained ability estimate. 

Kifer and Bramble chose to define their criterion score as 80% of 
the items correct or 54.4 items correct. Their cutoff score was there- 
fore 55. A raw score of 55 yields an ability estf?hate of 1.69, with a 
standard error of .33. Suppose a raw score of 60 were obtainec^. What 
is the probability that this score exceeds the criterion score of 55? 

V 

The solution requires that we find the probability that this score 
is part of the criterion distribution, with mean equal to 1.69 and stan- 
dard de vis ion equal to .33. (1) Kifer and Bramble's parameter estimates 
show that an observed score of 6o has an ability value equal to 2.32. 
(2)' 2.32 - 1.69 = .63 units of difference between the observed and cri- t 
terion abilities. (-3) .63/. 33 = 1.91 standard deviations of difference 
between the ability values. (4) A table of the normal distribution 
shows that 1 - F(1.91) ^ .03. Therefore, the ability value of 2.32 
has a probability = w03 of coming from a normal distribution with'^ a 
mean = 1.69 emd standard deviation = .33. 

Page Jif, para 6, line 5 — (sp) "deviation" not Mevision" 

There is a second method by which ability estimates may be com- 
pared to mastery standards. This method requires the probability^ that 
the criterion ability is part of the distribution which has a given 
(observed) ability as its mean *and the given ability standard error 
as its standard deviation. We now need to find the- probability that 
the true ability corresponding to a score of 60 does not exceed the 
criterion ability. (1) Kifer and Bramble's parameter estimates show 
that an observed score of 60 has an ability value equal to 2.32 and a 
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^ standai-d error equal to .39. (2) 2.32 - 1.69 = .63 units of difference 
* between the observed and criterion abilities. (3) .63/. 39 ==1.62 stan- 
dard deviaitons of difference between the abiliti^.^. (4) A table of the 
normal distribution shows that 1 - F(1.62) = .05.^ Therefore, the abil- 
■j^ity value of 1.69 has a probability of .05 of coming from a normal dis- 
tribution with mean = 2.32 and standard deviation = .39. Therefore, 
the probability that an examifiee with a score of -60 has a true ability 
below the criterion value = .05, which is the : Type II error analog that 
the criterion score would not be obtained by gihance given the obtained 
abilityl y'' 

■ ■ 

'Anderson et al.- (1968) investigated the hypothesis that Rasch item 
easiness estimates are independent of the ^biiit;y of the calibrating 
sample, and that the item easiness estimates are more stable when only 
items that fit the model arey considered. The^, \jised the 45-item spiral 
omnibus intelligence test for screening applic|artts to the Australian 
Army or Royal Australian Navy. Samples of 6a| ^tecruit applicants to 
the Citizen Military Force (CMF) and 874 recriiiit; applicants to the Royal 
Australian Navy were studied. Twelve items were deleted for zero or 
for 100% correct responses. 

For the CMF sample, 30 items (91%) fit the model at the .01 confi- 
dence, level, and 25 items (76%) fit the model at the more stringent .05 
level of confidence. (The level of confidence represents the probability 
of obtaining the observed pattern of responses, assuming that the model 
is adequate to explain performance on the item.) For the Navy sample, 
the corresponding findings were 22 items (67%) 'and 16 items (48%) . 

The correlation between the item easiriWss; estimates from both sam- 
ples was .958 (based upon 33 items) . When^the items that failed to fit 
the model at the .05 level were deleted, the correlation increased to 
.990. It therefore appears that the item Easiness ratios were indepen- 
dent of the ability of the samples from wh^ch they were compute^'. It 
should be critically noted that an intelligence test was used, and that 
the two subject populations .probably did n^t differ significantly. 

In a more recent study, Tinsley and Dawis ! (1975) gave four types 
of tests (verbal, numerical, picture, and item-symbol analogies) to four 
groups of subjects: college students, high school students, civil ser- 
vice clerks, and clients of the state Division iof Vocational Rehabilita- 
tion^ (DVR) . If Wright's findings could be replicated,' then the 
ability estimates of -one group should correlate highly witfv the ability 
estimates of another group for the same test. Of the 10 correlations 
that were computed (e .g ., college students and high school students for 
the picture test, high school students and. DVR clients on verbal analo- 
gies), all reached +.999. The invariant relationship between the'ability 
estimates calculated for a 25-item verbal analogies test for 630 college 
students and 90 DVR clients replicated £he relationship reported by 
Wright (1967) and shown in Figure 2. Tinsley and Dawis conclude. _that 
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"... Rasch ability es*tiinates are invariant with respect to the ability 
of the calibrating sample." (p. 337) 

Tinsley and Dawis also investigated the degree to which the itejn 
parameters (item difficulty estimates and z~item difficulty ratios) were 
invariant when the |^ analyses were performed on all items of the test. 
The correlation of item difficulty estimates for a given test from two 
examinee groups tended to be rather large (+.90) . Interestingly, cor- 
relations close to zero/jWere obtained from the DVR group with both high 
school and college students. This unexpected finding may be attributed 
to the small (n = 89) sample of DVR subjects. Generally, the item easi- 
ness ratios were invariant with respect to the ability of the calibrat- 
ing sample of examinees, even though several of the comparisons used 
samples of questionable size. 

Evaluation . The studies cited have demonstrated that if the assump- 
tions are met, or even reasonably approximated, then person- free test 
calibration and item-free person measurement can be achieved by using 
this one-^parameter logistic model. Although Hambleton and Traub (1973) 
report that a logistic models with an item discrimination index as- a 
second parameter provides a better fit to their data^ the inclusion of 
this second parameter violates ^true "ob jectivi'ty in measurement" (Wright, 
1967): 

Several potential shortcomings may pose some difficulty in success- 
fully implementing the model: (1) a pool of items must be developed 
that conforms to this item- analysis model, and the items must be cali- 
brated (perhaps 20% of the items will have to be either discarded or 
revised) ; (2) the item calibration and standardization procedures re*- 
quire dozens of items and hiindreds of subjects; (3) the model does not 
make direct predictions about optimal test lengths or cutting scores as 
do the models of Macready and Novick and L6wis; and (4) the mathematics 
of the model can become quite complex, posing problems for actually im-- 
plementing the model and for interpretation of output. However, recent 
publications and the availability .of computer programs (Wright & Mead, 
1975, -1976) alleviate this difficulty. 

The major virtues 6f the Rasch md^el can be siinimarized as follows: 
(1) Once a test has been standardized on any group of subjects, it can 
be given again to a different group, without the need to create parallel 
forms. For example, a test which had been developed by giving it to 
"masters" could later be given to "nonmasters. " (2) All cibilities will 
be on the/same scale, regardless of the subset of items from which these 
abilities were estimated. Thus, person A can be measured on a hard test, 
and person B on an easy test. 
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Regression Theory 
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Assumptio ns yid Rationale . The criterion- referenced testing litera 
ture has tended^^^P emphasize the supposed dichotomy between classical 
test theory^-afra theSemerging CRT theory. The following discussion of 
regression as a mean^for assessing mastery is intended to point out 
the similarities between several CRT strategies and classical theory. 
Specifically, both the Bayesian and logistic models produce estimated 
distributions of ability, as does classical regression. A cutoff score 
must still be set at some point on the ability (score) distributions, 
regardless of what model is used to derive the distributions. This sec- 
tion simply portrays classical regression theory in terms of CRT theory. 

The regression- theoretic approach of the "classical testing model" 
(Lord S Novick, 1968) describes the reason for lack of perfect mastery- 
nonmastery observed scores in terms of specified or estimated errors of 
measurement. The observed score is considered to be an unbiased esti- 
mate of an examinee's true score. It is then possible to derive a 
regression function that could be used to estimate true scores from 
observed scores. The equation for the regression function is 

R(t|x) = r^^.X 4. (1 - r^^.)m , • (19) 

/ 

ft 

where R(t|x) = the true score t given the observed score X, r^^. = the 
reliability of the test, and = the mean of the observed scores. 

. The magnitude of several types of error may also be determined." 
The error of measurement is the error involved when, . for a randomly 
selected 'examinee, we take the observed score as an estimate of the 
true score. This can be expressed as E = X - T, and the random variable 
E, taking on values of e, is called the error of measurement. The 
standard deviation of this error of measurement, called' the standard 
error of measurement, can be expressed in terms of the standard devia- 
tion of observed scores and the reliability of the test: 



^E ~ ^x V^-' " ^xx'^ - (20) 

The difference^etween the linear regression estimate and the true 
score itself is cal*d the error of estimation, and is expressed sym- 
bolically as e = r^^(x - m) (T - m^) . , (21) 

The standard deviation of these erro rs, called th e standard error 
of estimation, is expressed as s = s / r (1 - r ) ^99^ 

e X V XX XX ' ^ ^ 

Example. A graphic representation of the regression technique for 
a five-item^ test is shown in Figure 3. For each observed score, an esti- 
mated true score is obtained from R(t|x), and the standard error of 
estimation s^ is calculated. A^ cutoff score based upon true scores may 



then be specified. (In this example, a true score of 4 correct has 
arbitrarily been chosen as the cutoff score.) 



The output of the regression model, like that for the Rasch model, 
is a set of distributions. The mean of each distribution is the value 
for each R(t|x), and the comon standard error for all of the distribu- 
tions is Sg. If the decision rule reouires that all examinees be classi- 
fied as masters when, the value of R(Tf)() exceeds the criterion, and that 
all other scores should lead to a nonmastery decision, then the probabil- 
ity of misclassification can be calculated. 

For persons with, observed scores and estimated true scores below 
the^ criterion value, the probability that such persons might be misclas- 
sified as nonmasters is simply the proportion of the distribution ex- 
ceeding the criterion value. For persons with observed scores and 
estimated^true scores above the criterion, the probability that such 
persons might be misclassified as nonmasters is the proportion of the 
distribution below the criterion. 

These probabilities of misclassification are represented as dotted 
and crosshatched areas, respectively, in Figure 3, If we assume that 
the. error of Estimation is normally distributed, then the probabilities 
• can be readily obtained from a table of normal probabilities'. 

Two final comments are necessai7. First, this procedure uses the 
standard error of estimate, rather than "the standard error of measure- 
ment; Sg will always be smaller than s^, since lore infomation is used 
in calculating the estimated true score with a regression function than 
in estimating true score as the observed score. Thus, there is good 
reason to use the estimated true scores R(t|x) in any analysis of test, 
data. Second, the assumption of normality becomes important only when 
calculating misclassification errors. If the standard error of estimate 
cannot be assumed to^be normally distributed, it may still be reported, 
and may prove to be useful, in obtaining an estimate of the goodness of 
. the test* 

^^^^"^^^0" ' regression theory approach is not a predictive 
model in the sense that the models developed by Dayton and Macready, 
Emrick, Millman, and Novick are predictive of desired test lengths and 
optimal cutoff scores. However, the regression approach does give prob- 
abilistic estimates of true scores, given the observed scores. The ■• 
assumptions of normally distributed standard errors of 'estimate and of 
equal standard errors for all abilities may also be difficult to meet, 
although such departures may not pose a serious problem. And, since 
this is a linear regression model, it is assumed that the regression 
of true scores on observed scores is linear. This is a generally rea- 
sonable, though perhaps overly simplistic, assumption to make. Because * 
the regression model has been used for many years longer than the other 
models reviewed in this paper, there is a greater theoretical and 



empirical literature to back it up than there is for the newer, less 
established models. For a more technical critique of the use of re- 
gression models for estimating true scores from observed scores, see 
Appendix B. 

SUMMARY AND CONCLUSIONS 



Nature of Performance Acquisition 

Performance acquisition is assumed to be an all-or-none phenomenon, 
according to the models developed by Emrick and by Dayton and M^cready 
. (see Table 1) . Hence, these models assume that error-free test per- 
formance is also dichotomous. But the binomial, Bayesian, logistic, 
and classical regression models assume that performance acquisition is 
continuous.- Performance on dichotomously scored test items must there- 
fore be mapped onto an equivalent position on the u/iderlyinq ability 
continuum (Roudabush, 1974). It is not possible 'to decide unequivo-' 
cally that one assumption is more correct than the other, since the 
nature 1 of » per fonnance acquisition most likely interacts with the par- 
ticular type of task. Some tasks tend to elicit unitary, highly prac- 
ticed, sequential behaviors,, and would seem to be performed in an all- 
or-none fashion. Tasks which require multiskilled performances would 
more closely approximate the assumptions of the continuous skill 
acquisition models. 

^ ■ ■ ■ ■ 

Measurement Error 

^Ieasurement error is defined as the difference between observed 
test score and true (iinobservable) score that would be obtained if mea- 
surement were perfect. It is most important when one tries to infer a 
true "error- free" score from observed data. The Block and Crehan methods 
do not estimate a true score >^ nor do they deal directly with measurement 
error. Rather, they relate observed scores- directly to an external cri- 
terion. Hence, any sys tematic error will not be a problem. But random 
errors which affect the consistency of observed scores will disturb the 
measurement process for individual cases. Fortuitously, such errors 
will tend to average out across groups of exaiAnees, allowing generali-i 
zations to.be made which should valid in the "long run." 

The alJ.-or-none models deal with measurement error by stipulating 
values for the probability of masters committing errors and for nonmas- 
ters guessing correctly. These values are obtained by fitting the all- 
or-none models to observed data. Responses from both mastery and non- 
mastery groups ckn be described by binomial distributions. 

The "continuous" models of Novick, Rasch, and regression theory 
deal with measurement error by reporting a standard error for each true 
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score estimate.- In particular, the i^asch model provides a check on how 
well the model's output approximates the observe#score matrix <Wright) 
and Mead, 1975, 1976). "Best fit" techniques ar# required for the 
Bayesian and regression models. The binomial models do not rely direct- 
ly on observed data, and hence, do not deal directly with measurement 
error. Instead, for any hypothesized level of mastery, the models pre- 
dict the observed scOre distribution. Adequacy of the models' predic- 
tions can be evaluated by. fitting data to the hypothesized distributions 
A more complete comparison of how these models are affected by meafure- 
ment error must await either Monte Carlo simulation studies or consider- 
able efforts of empirical research.- 



Classification Error 



Unlike measurement error, classification error refers to assigning 
individuals to inappropriate mastery level groups— masters to the non- 

, mastery group, and nonmasters to the mastery level group. ^Such errors 
could occur even with eri:or-free measurement. However, measurement 
error interacts with classification error, further complicating the 
decisionmaking process of assigning examinees to mastery level groups. 
Suppose that, because of measurement error, all estimates of true score 
tended to be inflated. For a given decision rule, this would tend to 

• decrease false negatives and increase false positives. Unfortunately, 
constant measurement error is the exception rather than the rule, making 
it virtually impossible to correct for it, and therefore separate it 
from classification error. 

The Block and Crehan models deal with classification error empiri- 
cally by comparing the decisions based on a test score with an external 
criterion. Hence, .the classification error can be determined simply by 
counting the number of observed misclassif ications . If examinee groups 
remain similar over time, these models probably provide useful and stable 
estimates of misclassif ication error. 

y Because none of the other models incorporates an external criterion, 
a direct measure of classification error is not possible. Instead, the 
models rely on the distributional information obtained for the estimated 
true scores. ^ With this information, it is possible to predict the prob- 
. ability of misclassif ication , given various cutoff scores. Further em- 
pirical work which incorporates art external criterion is needed to 
verify the accuracy of such predictions. 

An essential ingredient of decisionmaking on the basis of CRT 
scores is the concept of cost— both to the examinee and to the system 
which he or she is being prepared to^ join. Consider the case of profes- 
.sional licensing, such as for new medical doctors: with an extremely 
strict^riterion, many would fail, morale would be low, and the system 
(society) would be deprived of much-needed medical service. However, 
with a very lax criterion, more examinees would pass who may not 
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(unfortunately) be qualified, and society would thus suffer the conse- 
quences of having "nonmasters" in practice. A similar case could be 
made for automobile mechanics, military medics, television repairmen, 
etc. Emrick's model is the only one that directly incorporates monetary 
costs of incorrect classifications into its procedures. However, an 
objective cost factor could also be incorporated into, the other models 
quite readily. But none of the models, as developed, deals with more 
complex kinds of cost, such as morale, costs to society (which may have 
to be^measured in terms of utility, not dollars), or even the cost of 
testing as opposed to not testing (Nader, 1976) . ^ 

J. 

Test Length 

For performance-oriented testing, where each item may require con- 
siderable time and expense, it is essential to be able to approxiirtate 
the minimum ■ number of items needed for good decisionmaking. 

Neither the Block nor the Crehan methods explicitly deals with 
test length. These models were designed to show what happens when 
existing test results are compared to an external criterion. However, 
since the data are available, it would be possible to reevaluate the 
results, assuming that only some of the test items were used. The 
regression approach allows for shorter tests, but does not provide 
for extrapolation to longer tests. 

Since the binomial model does not rely on observed data, ,'-esults 
for tests of any length can be predicted. This aspe ct of _ie rodel is 
particularly attractive, since a first approNiir tion to test Per gth can 
be easily tried out. 

The all-or-none models use observed data to help generate tr.e neces- 
sary parameters. Once the values are available, it is possible "o pre- 
dict the results for tests of any .length. As . ^ the Bayesian rrcdel, 
such predictions will be valid only if the examinee groups' remain rela- 
tively stable. . •[-): 

The Bayesian models can also be used as a predictbr for test re- 
sults of any test length. However, estimates of the values of several 
prior probabilities must be specified. In order for the predicted 
results- to be applicable to real data, the estimated prior probabili- 
ties must be close approximations to the priors as determined post hoc, 
after data have been collected. The main feature of this model — to 
reduce test length ,as a function of increasing prior inf ormation--will 
be minimized to the extent that the prior information departs from cor- 
rectly characterizing the population's proficiency under -investigation . 

The logistic model of Rasch can only be used to predict the results 
on a test that includes items that have already been calibrated. How- 
ever, the logistic nature of the model makes it extremely powerful in 
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this respect, since the item difficulty values calcul^^^^ part o:e 
the procedure are invariant across examinee groups of ^i^^^^in^ ^^^-^^^ 
any subset of items can be used with any group of ^^^^^n^es- Furthej:^ 
more, the errors associated with each calibrated item ^^^^^able, 
which can lead to precise predictions of classif ica^ior^ ^^ror ^^^t^ 
made up of a siabset of the original item pool. 



Conceptuaiizati on^ ^f Mastery 

' • The only models that explicitly define mastery ^r^ aj-^'^^^^^n^ 
models. Deviations from perfection or total lack ^^i ^ are ^^^i^^^ 

as measurement error. Mastery is not explicitly defin^^ of th^ 

other models. Either test performance is related to so:^^^ other P^f^^J:^^ 
ance (Block and Crehan) or estimated true score on ^ contin^^ 
provided. The models can then be used to evaluate test j^^^^it^ on ar^y 
specified definition of rnastery. 

These (continuous) models require that the tester gj^trei^elY S^^^^ 
sitive to system requirements. If mastery is defined ij^ ^Qxra^ ^ejry 
high performance, then very few examinees are likely to c^lassi^^e^ 
as masters; hov^ever, if mastery is defined in terms of ^^^^ Remanding 
standards, the tester (and the system) nans . the risk of j^^^^j^g a mas- 
tery group that is less than adequate. Thus, the ^jiiSity of the def:i^^ 
nition of mastery in terms of the system requirements ^^^^^g a' 
issue. Empirical studies are needed in specif ic content ^j:eas tP ^et^^^ 
mine "how much ability" a master should have. 



Charac te r is t i c s o^f Itenjs ' « 

; Only the Rasch logisHc model, of all the models ^1^^^^^^^ ^^i^ 
paper, is designed for item analysis. Oth,er models ^^^^^^^e^ either 
assumptions or as definitions, such matters as how itexns' samP^^^' 
item difficulty, item homogeneity, and item independency^ cert^^^"^^ 
an item set can be shown to violate these assumptions oj: <^g£initi°^^ ' 
the application of such a Tnodel would be questionable- tittle theoret:!^^ 
cal or empirical work has been done to demonstrate the i:ci]^^gtneSS of 
these models to violations of the assumptions. 
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APPENDIX A 




- • A GENERALIZATION OF THE EMRICK MODEL FOR THE CASE OF 

UNEQUAL PROPORTIONS OF MASTERS AND NONMASTERS 

^ Kenneth I. Epstein"*" 

> 

. • ^ The phi coefficient is a legitimate measure of correlation for 
data expressed, as frequencies or proportions; it is not appropriate 
for conditional probabilities. The entries in the table of measure- 

^ment errors proposed by Emrick and Adams (1970) and Emrick (1971a, 
l971b) are conditional probabilities. A simple numerical example 

■illustrates the type of problem which may occur if conditional prob- 
abilities are used to calculate ({) . Assume that a group of examinees 
is made up of 80% masters and 20% nonmasters, that 10% of the mastery 

"group incorrectly respond to -an item, anjT that 5% of the nonmastery 

•group correctly respond to the item. This situation is represented 

in a. fourfold table in Table A-1. 



Table A-1 

Hypothetical Response Data for 
Masters and Nonmasters 



r. 



True State 


Observed 


Response 






Wrong 


Correct 




Master 


.10 


.70 


.80 


Nonmaster 


.15 


.05 


.20 




.25 


.75 


1.00 



^,,The phi coefficient for Table 1 is: 

(.70) (.JrSf - (.10) (.05) 



= .5774 



>/(.80) (.20) (.25) (.75) 

e above represents a valid use of the phi coefficient. 



1* " . 

^ My appreciation to Dr. George Macready for pointing out the problem 
. and suggesting tbe direction of its solution. 
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We may now calculate a and 6 for the above da'ta. a is defined as 
the probability that a noninaster responds correctly. 6 is defined as 
the probability that a master respohds incorrectly. For this example: 

a = .05/. 20 = .25^^ 1^ a = .750 

e = .10/. 80 = .125 1 - 6 = .875 

These data are represented in Table A-2. 

Table A-2 

Measurement Errors and Mastery State 
for Hypothetical Data 



True state Observed response 



Wrong Correct 



Mastery 6 = .125 1 - B = .875 1 

Nonmastery 1 = a = .750 a = .250 1 

.875 • 1.125 2 



The phi coefficient for Tabl^ A-2 is: 

(.875) (.750) - (.125) (.250) . 

.(J) = — = .6299 ^ 

y/JT) (1) (T975) (1.125) 

Clearly tlie two calculated values of are not in agreement. Tcibl 
A-2 is the sort of analysis proposed by Emrick and Adams. It does not 
represent a valid application of the phi coefficient. 

Fortunately, one can obtain a table of proportions similar to 
'fable A-1 f^om a table of measurement errors similar to Table A-2, 
simply by multiplying each entry in the mastery row of Table A-2 by the 
proportion of masters, and by multiplying each entry in the nonmastery 
row of Table A-2 by the proportion of nonmasters. The general form for 
this relationship is represented in Table A-3. 



0 
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Table A- 3 



Table of Proportions for Observed Responses 
and Mastery State in Terms of .a/; 6', P(M) and P(M) 

True state Observed response 



Wrong ■ Correct 



Mastery p(m)6 p(m) (i Q) p(m) 

Nonmastery p(m) (1 - a) P(M)a p(m) 

P(M)6 + p(M)(l - a) p(M)a + P(M) (1 - 6) 1.0 



The phi coefficient for Table A-3 is derived as follows • 

_ V ■■ f 

P(M) (1 - 6)P(M) (1 - a) - p(M)6 P(M)a / 
* " _ _ 

yJl^WB •+ P(M) (1 - Ct) ] [p(M)a + P(M) (1 - Q) ] P(M)P(M) 

P(M)P(M) [(1 - 6) (1 - a) - 6a] - 

•^[P(M)6 + P(M) - P(M)a] [P(M)a + P(M)|^- P(M)6] P(M) P(M) 
P(M)P(M) [1 - 6 - ct] 



rP(M)P(M)a6 + P(M)26 - p(M)2 q2 + p.(M) 2^ + p(M)P(M) - P(M)P(M)6 - 



- 2 2 

P(M) a - p(M)p(M)a + p(M)p(M)a6] p(m)p(m) 



P(M)P(M) [1 - a - 6] 



tnR + P^M) 0 P(M) „2 P(M) „ P(M) 2 



[P(M)P(M) f 



[1 - a - 6] 



Finally, we note that for the case where P(M) = P (M) , the formula 
above reduces to the formula given by Emrick and Adams : 
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•• [1 - a - 6] 

Vl - a - 6 + 2 a6 + 6 - 6^ + a 
[1 - a - 6] 



,Vl - [a^ - 2 ae + 6^] 
[1 - a - 6] 



Vl - (a - 6)^ 

For the example cited in the text, 

1 - .06 - .12 .82 

(f) = = = .822. 

^1 - .0036 .998 

If we have a three-item test, upon si±>stituting into equation (3), 
we obtain , *^ 



k = 



1 -'!o6 ('^^ E^w) 

/ .06 X .12 \ 

.^^^ \^ (1 - .06) (1 - .12) j 

log .128 + 0 

— = .4339. 



log .0087 
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APPENDIX B 



CRITIQUE OF THE SIMPLIFyiNG ASSUMPTIONS IN USING 
REGRESSION MODELS FOR ESTIMATING TRUE SCORES 
FROM OBSERVED SCORES 



James McBride 
Army Research Institute 



Since R(t|x) is not an unbiased estimator of T, the standard devia- 
tion of the error of estimate e is not the same as the conditional 
standard deviation of the true score for a given observed score. That 
IS, If e IS an erroir^of estimate (T - T) , then a2(e|x) = a2(T|x) + bias2. 
Here, O (t|x) is the conditional variance of the true scores for given 
observed scores, which is the distribution portrayed in Figure 3 and used 
for inference to the misclassification probabilities. 

_ However, a2(e|x) (or equivalenWy, is then not the appro- 

.priate variance unless there is no bias; that is, unless E(t|t) = t 
And this latter relationship is generally not the case. Estimation of 
classification error probabilities using a2(e) as the conditional vari- 
ance would therefore be inappropriate.' 

Linear regression of T on x is a convenient simplifying assumption; 
but in actuality, the regression may often be nonlinear. Also, the 
distribution of errors may seldom be normal— or even symmetrical; the 
same holds true for the conditional distribution of T. in sum, the 
estimation of error probabilities from simplified linear regression 
models may be considerably distorted due to the above complicating 
factors. 
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1 USAEC, Ft Monmouth, ATTN: AMSEL-Sl-CB 

1 USAEC, Ft Monmouth, ATTN: C, Fad Dev Br 

1 USA Materials Sys /Vhal Agcy, Aberdeen, ATTN: AMXSY-P 

1 Edgbwood Arsenal, Aberdeen, ATTN: SAREA-BL-H 

1 USA Ord Ctr & Sch, Aberdeen, ATTN: ATSL-TEM-C 

2 USA Hum Engr Lab, Aberdeen, ATTiM: Library/Dir 

1 USA Combat Arms Tng Bd, Ft Benning, ATTN; Ad Supervisor 
1 USA Infantry Hum Rsch Unit, Ft Benniny, ATTN: Chief 
1 USA Infantry Bd, Ft Benning, ATTN: STEEC-TE-T 
1 USASMA, Ft Bliss, ATTN: ATSS-LRC 
1 USA Air Def Sch, Ft Bliss, ATTN: ATSA-CTD-ME 
1 USA Air Def Sch, Ft Bliss, ATTN:^Tech Lib 
1 USA Air Def Bd, Ft Bliss, ATTN: FILES 
g^'l . USA Air Def Bd, Ft Bliss, ATTN: STE8D-P0 
1 USA Cmd & General Stf College, Ft Leavenworth, ATTN: Lib 
1 USA Cmd & General Stf College, Ft Leavenworth. ATTN: ATSW-SE-L 
1 USA Cmd & General Stf College, Ft Leavenworth. ATTN: Ed Advisor 
1 U^A Combined Arms Cmbt Dev Act, Ft Leavenworth. ATTN: DepCdr 
1 USA Combined Arms Cmbt Dev Act. Ft'Leavenworth, ATTN: CCS 
1 USA Combined Arms Cmbt 6ev Act, Ft Leavenworth. ATTN: ATCASA 
1 USA Combined Arms Cmbt Dev Act. Ft Leavenworth. ATTN: ATCACO-E 
1 USA Combined Arms Cmbt Dev Act, Ft Leavenworth, ATTN: ATCACC^Cl 
1 USAECOM. Night Vision Lab. Ft Belvoir. ATTN; AMSEL-NV-SD 
3 USA-Computer Sys Cmd. Ft Belvoir. ATTN: Tech Library 
1 USAMERDC. Ft Belvoir. ATTN: STSFB-DQ 
1 USA Eng Sch, Ft Belvoir. ATTN: Library 
1 USA Topographic Lab. Ft Belvoir. ATTN: ETL-TD-S 
1 USA Topographic Lab. Ft Belvoir. ATTN: STINFO Onter 
1 USA Topographic Lab. Ft Belvoir. ATTN: ETL-GSL 
1 ^USA Intelligence Ctr & Sch. Ft Huachuca. ATTN: CTD~MS 
1 USA Intelligence Ctr & Sch. Ft Huachuca. ATTN: ATS-CTD-MS 
1 USA Intelligence Ctr & Sch. Ft Huachuca, ATTN: ATSI-TE 
1 USA Intelligence Ctr & Sch, Ft Huachuca. ATTN: ATSI-TEX-GS 
1 USA Intelligence Ctr & Sch. Ft Huachuca. ATTN: ATSI-CTS-OR 
1 USA Intelligence Ctr & Sch. Ft Huachuca. ATTN: ATSI-CTD-DT 
1 USA Intelligence Ctr & Sch, Ft Huachuca, ATTN: ATSI-CTD-CS 
1 USA Intelligence Ctr & Sch, Ft Huachuca. ATTN; DAS/SRD 
1 USA Intelligence Ctr & Sch, Ft Huachuca, ATTN: ATSI-TEM 
1 USA Intelligence Ctr & Sch, Ft Huachuca. ATTN: Library 

1 CDR. HQ Ft Huachuca, ATTN: Tech Ref Div 

2 CDR, USA Electronic Prvg Grd, ATTN: STEEP-MT-S 
1 CDR, Project MASSTER. ATTN; Tech Info Center 

1 Hq MASSTER. USATRADOC. LNO 

1 Research Institute. HQ MASSTER, Ft Hood 

1 USA Recruiting Cmd. Ft Sherdian. ATTN: USARCPM-P 

1 Senior Army Adv.. USAFAGOD/TAC. Elgin AF Aux Fid No. 9 

1 HQ USARPAC. DCSPER. APO SF 96558. ATTN; GPPE-SE 

1 Stimson Lib. Academy of Health Sciences. Ft Sam Houston 

1 Marine Corps Inst,. ATTN: Dean-MCI 

1 HQUSMC, Commandant. ATTN: Code MTMT 51 

1 HQUSMC. Commandant, ATTN: Code MPi-20 

2 USCG Academy, New London. ATTN: Admission 
2 USCG Academy. New London, ATTN; Library 

1 USCG Training Ctr, NY. ATTN: CO 

1 USCG Trailing Ctr. NY. ATTN: Educ Svc Ofc 

1 USCG. Psychol Res Br. DC, ATTN: GP 1/62 

1 HQ Mid-Range Br. MC Det. Quantico, ATTN: P&S Div 
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1 U3 Marine Corps Uaision Ofc.UMC, Alexandria, ATTN: AMCGS-F 
1 USATRADOC, Ft Monroe, ATTN: ATRO-ED 
8 USATRADOC, Ft Monroe, ATTN: ATPR-AD 

1 USATRADOC, Ft Monroe, ATTN: ATTS-EA 
USA Forces Cmd, Ft McPherson, AUU: Library 
USA Aviation Test Bd, Ft Rucker, ATTN: STEBG-PO 
USA Agcy for Aviation Safety, Ft Rucker, ATTN: Library 
USA Agcy for Aviation Safety, Ft Rucker, ATTN: Educ Advisee 
USA Aviation Sch, Ft Rucker, ATTN: PO Drawer O 
HQUSA Aviation Sys Crmi, St Louis, ATTN: AMSAV-ZDR 

2 USA Aviation 9yt Twt Act., Edvi^ards AFB, ATTN: SAVTE-T 

1 USAAirDefSch, Ft Bliss, ATTN: ATSATEM 
USA Air Mobility R$ch & Dev Lab, Moffett Fid, ATTN: SAVDL-AS 
USA Aviation Sch, Res Tng Mgt, Ft Rucker, ATTN: ATST'T-rjm 
USA Aviation Sch, CO, Ft Rucker, ATTN: ATST-D-A 
HO, DARCOM, Alexandria, ATTN: AMXCD-TL 
HO, DARCOM, Alexandria, ATJN: CDR 
US Military Academy, West Point, ATTN: Serials Unit 
US Military Academy, West Point, ATTN: Ofc of Milt Ldrshp 
US Military Academy, West Poim, ATTN: MAOR 
USA Standardization Gp, UK, FPO NY, ATTN: MASE-GC 
Ofc of Naval Rsch, Arlington, ATTN: Coda 452 
Ofc of Naval Rsch, Arlington, ATTN: Code 458 
Ofc. of Naval Rsch, Arlington, ATTN: Code 450 
Ofc of Naval Rsch, Arlington, ATTN; Code 441 
Naval Aerospc Med Res Lab, Pensacola, ATTN: Acous Sch Div 
Naval Aerospc Med Res^Lab, Pensacola, ATTN: Code L51 
Naval Aerospc Med Res Lab, Pensacola, ATTN: Code L5 
Chief of NavPers, ATTN: Pers-OPi 
NAVAIRSTA, Norfolk, ATTN: Safety Ctr 
Nav Oceanographic, DC, ATTN : Code 6251 , Charts & Tech 
Center of Naval Anal, ATTN : Doc Ctr 
NavAirSysCom, ATTNi AtR-53l3C 
Nav BuMed, ATTN: 713 
NavHelicopterSubSqua 2, FPO SF 96601 
AFHRL(FT) William AFB 
AFHRL (TT) Lowry AFB 
AFHRL IAS) WPAFB,OH 
AFHRL IDOJZ) Brooks AFB 
AFHRL IDOJN) Lackland AFB 
HQUSAF IINYSD) 
HQUSAF (DPXXA) 
AFVTG IRD) Randolph AFB 

3 AMRL(HE)*WPAFB,OH 

2 AF Inst of Tech, WPAFB, OH, ATTN: ENE/SL 
ATC (XPTD) Randolph AFB ' 
USAF AeroMed Lib, Brooks AFB (SUL-4). ATTN: DOC SEC 
AFOSR (NL), Arlington 

AF Log Cmd, McClellan AFB, ATTN: ALC/DPCRB 
Air Force Academy, CO, ATTN: Dept of Bel Sen 
NavPers & Dev Ctr, San Diego 
2 Navy Med Neuropsychiatric Rsch Unit. San Diego 
1 Nav Electronic Lab, San Diego, ATTN: Res Lab 
1 Nav TrngCen. San Diego. ATTN: Code 90OO-Lib 
1 NavPostGraSch. Monterey. ATTN: Code 55Aa 
I NavPostGraSch. Monterey. ATTN: Code 2124 
1 NavTrngEquipCtr. Orlando. ATTN: Tech Lib 
' 1 US Dept of Labor. DC. ATTN: Manpower Admin 
1 US Dept of Justice. DC. ATTN: Drug Enforce Admin 
1 Nat Bur of Standards. DC. ATTN: Computer Info Section 
1 Nat Clearing House for MH— Info, f^ockville 
1 Denver Federal Ctr, Lakowood. ATTN: BLM 
12 Defense Documentation Center ^ 

4 Dir Psych. Army Hq. Russell Ofcs. Canberra 

1 Scientific Advsr. Mil Bd. Army Hq. Russell Ofcs..Canberra 
1 Mil and Air Attache. Austrian Embassy 

1 Centre de Recherche bes Factours. Humaine de la Defense 
Nationale. Brussels 

2 Canadian Joint Staff Washington 

1 C/Air Staff. Royal Canadian AF. ATTN: PersStd Anal Br 
"^3 Chief. Canadian Def Rsch Staff, ATTN; C/CRDS(W) 
4 British Def Staff. British Embassy. Washington 



1 Def & Civil Inst of Enviro Medicine. Canada 

1 AIR CRESS. Kerwlngton, ATTN: Info Sys Br 

1 MilitaerpsvkologiskTjtnette, Copehagen 

1 Military Attache, French Embwsy, ATTN: Doc Sec 

1 Medecin Chef, C.E.R.P.A.-Ar$enal, Toulon/Naval France 

1 Prin Scientific Off, AppI Hum Engr Rsch Div, Ministry . 

of Defense, New Delhi 
1 Pers Rsch Ofc Library, AKA, Israel Defense Forces 
1 Ministeris van Defensie, DOOP/KL Afd Sociaal 

Psycho log ische Zaken, The Hague, Netherlands ^ 
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